HTTP Status Codes – Why Won’t My Website Crawl?

HTTP Status Codes Crawling With The Screaming Frog SEO Spider

When a website does not crawl as expected with only a single URL being returned, the ‘Status’ and ‘Status Code’ are the first things to check to help identify what the issue is.

A status is a part of Hypertext Transfer Protocol (HTTP), found in the server response header, it is made up of a numerical status code and an equivalent text status.

When a URL is entered into the Screaming Frog SEO Spider and a crawl is initiated, the numerical status of the URL from the response header is shown in the ‘status code’ column, while the text equivalent is shown in the ‘status’ column within the default ‘Internal’ tab view e.g.

Status Codes

The most common status codes you are likely to encounter when a site cannot be crawled and the steps to troubleshoot these, can be found below:

 

Status Code – Status

0 – Blocked By Robots.txt
0 – DNS Lookup Failed
0 – Connection Timeout
0 – Connection Refused
0 – Connection Error / 0 – No Response
200 – OK
301 – Moved Permanently / 302 – Moved Temporarily
403 – Forbidden
404 – Page Not Found / 410 – Removed
500 – Internal Server Error

 

0 – Blocked by robots.txt

This isn’t technically a valid HTTP response, it’s how the SEO Spider shows there hasn’t actually been an HTTP response due to the robots.txt of the site disallowing the SEO Spider’s set user agent from accessing the requested URL. Hence, the actual HTTP response is not seen due to the disallow directive.

robots.txt

Things to check: What is being disallowed in the sites robots.txt? (Add /robots.txt on subdomain of the URL crawled).

Things to try: Set the SEO Spider to ignore robots.txt (Configuration > Robots.txt > Settings > Ignore Robots.txt) or use the custom robots.txt configuration to allow crawling.

Reason: The SEO Spider obeys disallow robots.txt directives by default.

 

0 – DNS Lookup Failed

The website is not being found at all, often because the site does not exist, or your internet connection is not reachable.

DNS Lookup Failed

Things to check: The domain is being entered correctly.

Things to check: The site can be seen in your browser.

Reason: If it can’t view the site, you could be experiencing PC / Network connectivity issues.

 

0 – Connection Timeout

A connection timeout occurs when the SEO Spider struggles to receive an HTTP response from the server in a set amount of time (20 seconds by default).

Connection Timeout

Things to check: Can you view the site in a browser, does it load slowly?

Things to try: If site is slow try increasing the response timeout and lowering speed of the crawl.

Reason: This gives the SEO Spider more time to receive information and puts less strain on the server.

 

Things to check: Can other sites be crawled? (bbc.co.uk and screamingfrog.co.uk are good control tests).

Things to try: Setting up exceptions for the SEO Spider in firewall / antivirus software (please consult your IT team).

Reason: If this issue occurs for every site, then it is likely an issue local to you or your PC / network.

 

Things to check: Is the proxy enabled (Configuration > Proxy).

Things to try: If enabled, disable the proxy.

Reason: If not set up correctly then this might mean the SEO Spider is not sending or receiving requests properly.

 

Things to check: IPv6 Connectivity.

Things to try: Add the following to the ScreamingFrogSEOSpider.l4j.ini file:

-Djava.net.preferIPv4Stack=true to encourage the use of IPv4.

Reason: It could be the configuration of your machine is able to do DNS resolution using IPv6 but has no IPv6 connectivity.

 

0 – Connection Refused

A ‘Connection Refused’ is returned when the SEO Spider’s connection attempt has been refused at some point between the local machine and website.

Connection Refused

Things to check: Can you crawl other sites? (bbc.co.uk and screamingfrog.co.uk are good control tests).

Things to check: Setting up exceptions for the SEO Spider in firewall/antivirus software (please consult your IT team).

Reason: If this issue occurs for every site, then it is likely an issue local to you or your PC / network.

 

Things to check: Can you view the page in the browser or does it return a similar error?

Things to try: If page can be viewed set Googlebot or Chrome as the user agent (Configuration > HTTP Header > User-Agent).

Reason: The server is refusing the SEO Spider’s request of the page (possibly as protection/security against unknown user-agents).

 

Things to check: Is the site HTTPS?

Things to try: If so install the relevant Java security fix for Java. Please note, this is not necessary for version 8.0 onwards.

Reason: If the site uses stronger crypto algorithms than is supported by default in Java the connection will be refused.

 

0 – Connection Error / 0 – No Response

The SEO Spider is having trouble making connections or receiving responses.

No Response

Things to check: Proxy Settings (Configuration > Proxy).

Things to try: If enabled, disable the proxy.

Reason: If not set up correctly then this might mean the SEO Spider is not sending/receiving requests properly.

 

Things to check: Can you view the page in the browser or does it return a similar error?

Reason: If there are issues with the network or site, the browser would likely have a similar issue.

 

200 – OK

There was no issue receiving a response from the server, so the problem must be with the content that was returned.

200 OK

Things to check: Does the requested page have meta robots ‘nofollow’ directive on the page / in the HTTP header or do all the links on the page have rel=’nofollow’ attributes?

Things to try: Set the configuration to follow Internal/External Nofollow (Configuration > Spider).

Reason: By default the SEO Spider obeys ‘nofollow’ directives.

 

Things to check: Are links JavaScript? (View page in browser with JavaScript disabled)

Things to try: Enable JavaScript Rendering (Configuration > Spider >Rendering > JavaScript). For more details on JavaScript crawling, please see our JavaScript Crawling Guide.

Reason: By default the SEO Spider will only crawl <a href=””>, <img src=””> and <link rel=”canonical”> links in HTML source code, it does not read the DOM. If available, the SEO Spider will use Google’s deprecated AJAX crawling scheme, which essentially means crawling an HTML snapshot of the rendered JavaScript page, instead of the JavaScript version of the page.

 

Things to check: ‘Limits’ tab of ‘Configuration > Spider’ particularly ‘Limit Search Depth’ and ‘Limit Search Total’.

Reason: If these are set to check 0 or 1 respectively, then the SEO Spider is being instructed to only crawl a single URL.

 

Things to check: Does the site require cookies? (View page in browser with cookies disabled).

Things to try: Configuration > Spider > Advanced Tab > Allow Cookies.

Reason: A separate message or page may be served to the SEO Spider if cookies are disabled, that does not hyperlink to other pages on the site.

 

Things to try: Change the user agent to Googlebot (Configuration > HTTP Header > User-Agent).

Reason: The site/server may be set up to serve the HTML to search bots without the necessity of accepting Cookies.

 

Things to check: What is specified in the ‘Content’ Column?

Things to try: If this is blank, enable JavaScript Rendering (Configuration > Spider >Rendering > JavaScript) and retry the crawl.

Reason: If no content type is specified in the HTTP header the SEO Spider does not know if the URL is an image, PDF, HTML pages etc. so cannot crawl it to determine if there are any further links. This can be bypassed with rendering mode as the SEO Spider checks to see if a <meta http-equiv> is specified in the <head> of the document when enabled.

 

Things to check: Is there an age gate?

Things to try: Change the user agent to Googlebot (Configuration > HTTP Header > User-Agent).

Reason: The site/server may be set up to serve the HTML to search bots without requiring an age to be entered.

 

301- Moved Permanently / 302 – Moved Temporarily

This means the requested URL has moved and been redirected to a different location.

301 Redirect

Things to check: What is the redirect destination? (Check the outlinks of the returned URL).

Things to try: If this is the same as the starting URL, follow the steps described in our why do URLs redirect to themselves FAQ.

Reason: The redirect is in a loop where the SEO Spider never gets to a crawlable HTML page. If this is due to a cookie being dropped, this can be bypassed by following the steps in the FAQ linked above.

 

Things to check: External Tab.

Things to try: Configuration > Spider > Crawl All Subdomains.

Reason: The SEO Spider treats different subdomains as external and will not crawl them by default. If you are trying to crawl a subdomain that redirects to a different subdomain, it will be reported in the external tab.

 

Things to check: Does the site require cookies? (View the page in a browser with cookies disabled).

Things to try: Configuration > Spider > Advanced Tab > Allow Cookies.

Reason: The SEO Spider is being redirected to a URL where a cookie is dropped, but it does not accept cookies.

 

403 – Forbidden

The server is denying the SEO Spider’s request to view the requested URL.

403 Forbidden

Things to check: Can you view the page in a browser or does it return a similar error?

Things to try: If the page can be viewed, set Googlebot or Chrome as the user agent (Configuration > HTTP Header > User-Agent).

Reason: The site is denying the SEO Spider’s request of the page (possibly as protection/security against unknown user agents).

 

404 – Page Not Found / 410 – Removed

The server is indicating that the page has been removed.

404 Not Found

Things to check: Does the requested URL load a normal page in the browser?

Things to try: Is the status code the same in other tools like (Websniffer, Rexswain, or fetch as Google etc.).

Reason: If the status code is reported incorrectly for every tool, the site/server may be configured incorrectly serving the error response code, despite the page existing.

 

Things to try: If the page loads then try Googlebot or Chrome as the user agent (Configuration > HTTP Header > User-Agent).

Reason: Site is serving the server error to the SEO Spider (possibly as protection/security against unknown user agents).

 

500 / 502 / 503 – Internal Server Error

The server is saying that it has a problem.

500 Server Error

Things to check: Can you view your site in the browser or is it down?

Things to try: If the site is up then try Googlebot or Chrome as the user agent (Configuration > HTTP Header > User-Agent).

Reason: Site is serving the server error to the SEO Spider (possibly as protection/security against unknown user agents).

 

It is possible for more than one of these issues to be present on the same page, for example, a JavaScript page could also have a meta ‘nofollow’ tag.

There are also many more response codes than this, but in our own experience, these are encountered infrequently, if at all. Many of these are likely to also be resolved by following the same steps as other similar response codes described above.

More details on response codes can be found at https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

  • Like us on Facebook
  • +1 us on Google Plus
  • Connect with us on LinkedIn
  • Follow us on Twitter
  • View our RSS feed

Download.

Download

Purchase a licence.

Purchase