How To Crawl A Staging Website
Introduction To Crawling Staging Websites
When a website is in staging or development, it should be restricted from being crawled and indexed by the search engines. This avoids it ranking in the search results, and potentially competing and cannibalising with a live website.
It’s essential to be able to audit sites in staging environments before they go live, which is where crawling them with the Screaming Frog SEO Spider can help.
Various methods are used to block search engines from a staging site, or to avoid content from being indexed, including putting it behind a login, using robots.txt, noindex and more. Staging servers typically lack the performance of production, and websites still in development are generally more fragile as well.
This tutorial guides you through how to configure the Screaming Frog SEO Spider to crawl any staging server and website. There’s a number of ‘gotchas’ that can catch out even experienced SEOs.
How To Crawl A Staging Server
Staging websites are usually restricted from being crawled by search engines and crawlers. There are various methods to prevent crawling, and each require a slightly different approach or configuration to bypass.
If the website uses robots.txt to block it from being crawled then only a single URL will be returned in the SEO Spider.
A ‘Blocked by robots.txt’ message in the ‘status’ and ‘indexability status’ columns will be displayed, with the indexability set as ‘Non-Indexable’.
To crawl the website, you will need to go to ‘Config > Robots.txt’ and choose ‘Ignore robots.txt’.
If the robots.txt file contains disallow directives that you wish the SEO Spider to obey, then use ‘custom robots’ via ‘Config > robots.txt’ to remove the global disallow and leave other directives in place.
This means you can mimic a crawl as it should be in a live environment.
If you’re using the free version of the SEO Spider where you don’t have access to the configuration and you’ve blocked your website from being crawled by robots.txt – you can use an ‘Allow’ directive in the robots.txt for the ‘Screaming Frog SEO Spider’ user-agent to get around it.
User-agent: Screaming Frog SEO Spider
The SEO Spider will then follow the allow directive, while all other bots will remain blocked.
This is our recommend approach for staging websites, as it means the search engines can’t crawl or index the URLs.
If the server requires a username and password to access it, then you’ll need to provide it to the SEO Spider to crawl the site. There’s a couple of main types of authentication, which have slightly different set-ups.
The most common is basic and digest authentication sever authentication, which you can see in a browser when you visit the website and it gives you a pop-up requesting a username and password.
If the login screen is contained in the page itself, this will be a web form authentication. More on both types below.
Basic & Digest Authentication
Basic and digest authentication is detected automatically when you crawl the website.
Often sites in development will also be blocked via robots.txt, so ensure you’ve followed our guidance on robots.txt above so it can be crawled.
Crawl the staging website and an authentication pop-up box will appear, just like it does in a web browser – asking for a username and password.
Enter your credentials, and the crawl will continue as normal. You cannot pre-enter login credentials – they are entered when URLs that require authentication are crawled. This feature does not require a licence.
Try the following pages to see how authentication works in your browser, or in the SEO Spider.
- Basic Authentication Username:user Password: password
- Digest Authentication Username:user Password: password
Web Form Authentication
There are other web forms and areas which require you to login with cookies for authentication to be able to view or crawl it. The SEO Spider allows users to log in to these web forms within the SEO Spider’s built in Chrome browser, and then crawl it. This feature requires a licence.
To login using web forms authentication, click ‘Configuration > Authentication > Forms Based. Then click the ‘Add’ button, enter the URL for the site you want to crawl, and a browser will pop up allowing you to log in.
Use the browser window to login as normal, then click ‘OK’, and ‘OK’ again. This has provided the necessary cookies to the SEO Spider and you can start your crawl as usual.
This feature is powerful because it provides a way to set cookies in the SEO Spider, so it can also be used for scenarios such as bypassing geo IP redirection, or if a site is using bot protection with reCAPTCHA or the like.
Please read our tutorial on crawling web form password protected sites.
Some staging platforms can restrict access by IP address.
As the SEO Spider crawls locally from the machine it’s run from, you need to provide this IP address to be included on the ‘allowlist’ of the server or platform used by the site, sometimes historically referred to as a ‘whitelist’.
If this is your own machine, you can find your IP by Googling ‘my IP address‘ and it will display at the top of the SERP.
Less Common Methods
While most staging websites are restricted by robots.txt, or authentication, we do also sometimes see various other set-ups which are covered below.
We have seen some test areas of a website only display an updated page when a specific cookie is supplied. This is often not on staging servers, but full production websites when changes are tested in limited form.
These alternative pages can be accessed by providing the required cookie in the SEO Spider’s request using custom HTTP Headers.
Click ‘Add’, enter ‘Cookie’ as the header name and provide the cookie in the ‘header value’ field.
If a name and value pair are required, you can enter them in combination separated with an equals (‘name=value’) in the field above.
You can then crawl the website and the relevant cookie will be supplied with each request made by the SEO Spider.
Some new websites are only visible initially via modifying a hosts file.
If you modify your own hosts file to view the website, then the SEO Spider will also be able to see the new site if you’re crawling locally from the same machine.
How To Configure Settings To Crawl Staging Sites
Sites in development can respond to HTTP requests differently to those in a live environment, and can often have robots directives that require additional configuration in the SEO Spider.
Websites in staging are generally slower and more fragile than those in production. They often can’t cope with the same load as a production server, and the site is work in progress after all.
The default 5 threads used by the SEO Spider should not generally cause instability. However, we recommend speaking to developers pre to crawling, confirming an acceptable crawl rate if required, and then monitoring crawl responses and speed in the early stages of the crawl.
If you start to see connection timeouts, server errors or the crawl speed is just very slow, you may need to reduce the crawl speed. Speed can be adjusted via ‘Config > Speed’.
If you continue to see issues, reduce crawl speed further. You can re-crawl URLs with no responses or server errors by a right click ‘re-spider’ in bulk.
Often development websites will have a sitewide ‘nofollow’ meta robots tag, or X-Robots-Tag in the HTTP Header. This is usually copied in without much thought alongside a ‘noindex’, as a ‘noindex, nofollow’.
A ‘nofollow’ is a very different directive to a noindex, and instructs a crawler not to follow any outlinks from a page.
You can see whether a page has a ‘nofollow’ under the ‘Directives’ tab and ‘nofollow’ filter.
By default the SEO Spider will obey these instructions, so if they are on all pages of the website, only a single page will be crawled.
If it does, go to ‘Config > Spider’ and enable ‘Follow Internal Nofollow’ to crawl outlinks from these pages.
If you’d like to discover external links as well, tick the option below too.
Sometimes sites in staging use a noindex, instead of blocking crawling of the website or in combination. A noindex doesn’t prevent crawling, but it does instruct search engines not to index the pages.
The use of noindex can be seen under the ‘Directives’ tab and ‘noindex’ filter.
While the SEO Spider will crawl pages with a noindex, it will see these pages as ‘non-indexable’. This means they won’t be considered for issues discovered in filters for things like duplicate, or missing page titles, meta descriptions etc.
Therefore, we recommend disabling ‘Ignore Non-Indexable URLs for Issues’ when a sitewide ‘noindex’ is present. This can be found in ‘Config > Spider > Advanced’.
This will mean URLs with ‘noindex’ on them will be considered for any on-page issues.
This is one that can easily catch you out, but the directive ‘none’ does not mean there are no directives present. A ‘none’ directive is actually the equivalent to ‘noindex, nofollow’.
How To Compare Staging Vs Live
The SEO Spider allows you to compare two crawls to see the differences. It has a unique ‘URL Mapping’ feature, that enables two different URL structures to be compared, such as a staging website against a production or live environment. You can compare entirely different hostnames, directories, or more subtle changes to URLs.
To compare staging against the live website, click ‘Mode > Compare’ and select the two crawls.
Then click on the compare configuration (‘Config > Compare’) and ‘URL Mapping’.
Input a regex to map the previous crawl URLs to the current crawl. Often it’s as simple as just mapping the hostname.
The staging and existing live site URLs are then mapped to one another, so the equivalent URLs are compared against each other for overview tab data, issues, and opportunities, the site structure tab, and change detection.
Find out more in our tutorial on How To Compare Crawls.
This tutorial will hopefully help you crawl any staging server, and crawl and audit the development site.
If you experience any issues crawling a website after following guidance above, check out the following FAQs –
- Why won’t the SEO Spider crawl my website?
- Why is the SEO Spider not finding a particular page or set of pages?
- HTTP Status Codes – Why Won’t My Website Crawl?
Alternatively, please contact us via support and we can help.
Join the mailing list for updates, tips & giveawaysHow we use the data in this form
Back to top