How To Crawl Large Websites

How To Crawl Large Websites Using The SEO Spider

Crawling websites and collecting data is a memory intensive process, and the more you crawl, the more memory is required to store and process the data.

The Screaming Frog SEO Spider currently uses RAM, rather than your hard disk to store and process data. This gives it some amazing advantages, as it makes it lightening fast, super flexible, you can view data in real-time, and sort and search, all while still crawling a website.

You’re ‘in the crawl’, rather than detached from it and simply ‘delivered a report’ at the end, when data has finished being pumped into a database and reports have been generated in the background ready to view. You get instant feedback, meaning you can see data immediately, control and adjust the crawl throughout.

However, there are some negatives to this approach, as machines have less RAM than hard disk space. This means the SEO Spider is generally better suited for crawling websites under 500k URLs. However, users are able to crawl more than this with the right set-up, and depending on how memory intensive the website is that’s being crawled. As a very rough guide, a 64bit machine with 8GB of RAM will generally allow you to crawl a couple of hundred thousand URLs.

When considering scale, it’s not just unique pages or data collected that needs to be considered, but actually the internal linking of the website. The SEO Spider records every single inlink or outlink (and resource), which means a 100k page website which has 100 site wide links on every page, actually means recording more like 10m links.

Do You Really Need To Crawl The Whole Site?

This is the question we always recommend asking. Do you need to crawl every URL, to get the data you need?

Advanced SEOs know that often it’s just not required. Generally websites are templated, and a sample crawl of page types from across various sections, will be enough to make informed decisions across the wider site.

So, why crawl 5m URLs, when 50k is enough? With a few simply adjustments, you can avoid wasting resource and time on these (more on adjusting the crawl shortly).

It’s worth remembering that crawling large sites takes up resource, but also a lot of time (and cost for some solutions). A 1 million page website at an average crawl rate of 5 URLs per second will take over two days to crawl. You could crawl faster, but most websites and servers don’t want to be crawled faster than that kind of speed.

However, with the above said, there are times where a complete crawl is essential. You may need to crawl a large website in it’s entirety, or perhaps the website is on an enterprise level with 50m pages, and you need to crawl more to even get an accurate sample. In these scenarios, we recommend the following approach to crawling larger websites.

1) Increase Memory Allocation

The SEO Spider as standard allocates just 512mb of RAM, which will generally allow you to crawl between 10-50k URLs of a website. So we usually recommend a minimum of 8GB of RAM to crawl larger websites, with a couple of hundred of thousand of pages. But the more RAM you have, the better!

When you reach the limit of memory allocation, you will receive the following warning.

SEO Spider out of memory warning

This is warning you that the SEO Spider has reached the current memory allocation and it needs to be increased to crawl more URLs, or it will become unstable. To increase memory, first of all you should save the crawl via the ‘File > Save’ menu.

You can then follow the instructions in our guides below on increasing your memory allocation, before opening the saved crawl and resuming it again. We generally recommend allocating 1GB less than your total RAM available. The SEO Spider will only use the memory when required and this just means you have the maximum available to you if and when you need it.

  1. Increasing Memory On Windows.
  2. Increasing Memory on Mac OS X.
  3. Increasing Memory on Mac OS X 7.2 or Earlier.
  4. Increasing Memory on Ubuntu.

You can then check to ensure your memory allocation has increased.

The more memory you are able to allocate, the more you’ll be able to crawl. So if you don’t have a machine with much RAM available, we recommend using a more powerful machine, or upgrading the amount of RAM.

2) Adjust What To Crawl In The Configuration

The more data that’s collected and the more that’s crawled, the more memory intensive it will be. So you can consider options for reducing memory consumption for a ‘lighter’ crawl.

Deselecting the following options under ‘Configuration > Spider’ will help save memory –

  1. Check Images.
  2. Check CSS.
  3. Check JavaScript.
  4. Check SWF.
  5. Check External Links.

You can also deselect the following crawl options under ‘Configuration > Spider’ to help save memory –

  1. Crawl Canonicals – This option only impacts crawling, they will still be extracted when deselected.
  2. Crawl Next/Prev – This option only impacts crawling, they will still be extracted when deselected.
  3. Extract Hreflang – This option means URLs in hreflang will not be extracted at all.
  4. Crawl Hreflang – This option means URLs in hreflang will not be crawled.

There are also other options that will use memory if utilised, so consider against using the following features –

  1. Custom Search.
  2. Custom Extraction.
  3. Google Analytics Integration.
  4. Google Search Console Integration.
  5. Link Metrics Integration (Majestic, Ahrefs and Moz).

This means less data, less crawling and lower memory consumption.

3) Exclude Unnecessary URLs

Use the exclude feature to avoid crawling unnecessary URLs. These might include entire sections, faceted navigation filters, particular URL parameters, or infinite URLs with repeating directories etc.

The exclude feature allows you to exclude URLs from a crawl completely, by supplying a list of a list regular expressions (regex). A URL that matches an exclude is not crawled at all (it’s not just ‘hidden’ in the interface). It’s also worth bearing in mind, that it will mean other URLs that do not match the exclude, but can only be reached from an excluded page will also not be crawled. So use the exclude with care.

We recommend performing a crawl and ordering the URLs in the ‘Internal’ tab alphabetically, and analysing them for patterns and areas for potential exclusion in real time. Generally by scrolling through the list in real time, and analysing the URLs, you can put together a list of URLs for exclusion.

For example, ecommerce sites often have faceted navigations which allow users to filter and sort, which can result in lots of URLs. Sometimes they are crawlable in a different order, resulting in many or an endless number of URLs.

Let’s take a real life scenario, like John Lewis. If you crawl the site with standard settings, due to their numerous facets, you can easily crawl filtered pages, such as below.

Selecting these facets generates URLs such as –

https://www.johnlewis.com/browse/men/mens-trousers/adidas/allsaints/gant-rugger/hymn/kin-by-john-lewis/selected-femme/homme/size=36r/_/N-ebiZ1z13yvxZ1z0g0g6Z1z04nruZ1z0s0laZ1z13u9cZ1z01kl3Z1z0swk1

This URL has multiple brands, a trouser size and delivery option selected. There are also facets for colour, trouser fit and more! The different number of combinations that could be selected are virtually endless, and these should be considered for exclusion.

By ordering URLs in the ‘Internal’ tab alphabetically, it’s easy to spot URL patterns like these for potential exlusion. We can also see that URLs from the facets on John Lewis are set to ‘noindex’ anyway. Hence, we can simply exclude them from being crawled.

noindex john lewis size pages

Once you have a sample of URLs, and have identified the issue, it’s generally not unnecessary to then crawl every facet and combination. They may also already be canonicalised, disallowed or noindex, so you know they have already been ‘fixed’, and they can simply be excluded.

4) Crawl In Sections (Subdomain or Subfolders)

If the website is very large, you can consider crawling it in sections. By default, the SEO Spider will crawl just the subdomain entered, and all other subdomains encountered will be treated as external (and appear under the ‘external’ tab). You can choose to crawl all subdomains, but obviously this will take up more memory.

The SEO Spider can also be configured to crawl a subfolder by simply entering the subfolder URI with file path and ensure ‘check links outside of start folder’ and ‘crawl outside of start folder’ are deselected under ‘Configuration > Spider’. For example, to crawl our blog, you’d then simply enter https://www.screamingfrog.co.uk/blog/ and hit start.

Screaming Frog blog crawling subfolder

Please note, that if there isn’t a trailing slash on the end of the subfolder, for example ‘/blog’ instead of ‘/blog/’, the SEO Spider won’t currently recognise it as a sub folder and crawl within it. If the trailing slash version of a sub folder redirects to a non trailing slash version, then the same applies.

To crawl this sub folder, you’ll need to use the include feature and input the regex of that sub folder (.*blog.* in this example).

5) Narrow The Crawl, By Using The Include

You can use the include feature to control which URL path the SEO Spider will crawl via regex. It narrows the default search by only crawling the URLs that match the regex, which is particularly useful for larger sites, or sites with less intuitive URL structures.

Matching is performed on the URL encoded version of the URL. The page that you start the crawl from must have an outbound link which matches the regex for this feature to work. Obviously if there is not a URL which matches the regex from the start page, the SEO Spider will not crawl anything!

As an example, if you wanted to crawl pages from https://www.screamingfrog.co.uk which have ‘search’ in the URL string you would simply include the regex:.*search.* in the ‘include’ feature.

include feature crawling

This would find the /search-engine-marketing/ and /search-engine-optimisation/ pages as they both have ‘search’ in them.

screaming frog include results

6) Limit the Crawl For Better Sampling

There’s various limits available, which help control the crawl of the SEO Spider and allow you to get a sample of pages from across the site, without crawling everything. These include –

  • Limit Crawl Total – Limit the total number of pages crawled overall. Browse the site, to get a rough estimate of how many might be required to crawl a broad selection of templates and page types.
  • Limit Crawl Depth – Limit the depth of the crawl to key pages, allowing enough depth to get a sample of all templates.
  • Limit Max URI Length To Crawl – Avoid crawling incorrect relative linking or very deep URLs, by limiting by length of the URL string.
  • Limit Max Folder Depth – Limit the crawl by folder depth, which can be more useful for sites with intuitive structures.
  • Limit Number of Query Strings – Limit crawling lots of facets and parameters by number of query strings. By setting the query string limit to ‘1’, you allow the SEO Spider to crawl a URL with a single parameter (?=colour for example), but not anymore. This can be helpful when various parameters can be appended to URLs in different combinations!

7) Run The SEO Spider In The Cloud With Lots of RAM

If you still need to crawl more, but don’t have a powerful machine with lots of RAM, then consider running the SEO Spider in the cloud and allocating silly amounts of RAM.

There’s some really comprehensive guides we recommend –

8) Remember To Save Regularly

If you’re pushing the SEO Spider to your memory limits, we recommend saving crawl projects regularly. If there is a problem, this means you won’t lose all the crawl.

You can save the crawl, by clicking ‘Stop’, then ‘File > Save’. Once the crawl has finished saving, simply hit ‘resume’ to continue the crawl again afterwards!

Happy Crawling!

Websites are really unique, but the basic principles outlined above, should allow you to crawl large websites more efficiently.

If you have any queries regarding our guide to crawling large websites, then please do just get in touch with our support team.

  • Like us on Facebook
  • +1 us on Google Plus
  • Connect with us on LinkedIn
  • Follow us on Twitter
  • View our RSS feed

Download.

Download

Purchase a licence.

Purchase