How To Crawl Large Websites

How To Crawl Large Websites Using The SEO Spider

Crawling websites and collecting data is a memory intensive process, and the more you crawl, the more memory is required to store and process the data. The Screaming Frog SEO Spider uses a configurable hybrid engine, that requires some adjustments to allow for large scale crawling.

By default the SEO Spider uses RAM, rather than your hard disk to store and process data. This provides amazing benefits such as speed and flexibility, but it does also have disadvantages, most notably, crawling at scale.

The SEO Spider can also be configured to save crawl data to disk, by selecting ‘Database Storage’ mode (under ‘Configuration > System > Storage’), which enables it to crawl at truly unprecedented scale, while retaining the same, familiar real-time reporting and usability.

8m page titles scrolling

TL;DR Version

If you’d rather not read the full guide below, below are the two main requirements to crawl very large websites.

1) Use a machine with an internal SSD, and switch to database storage mode (‘Configuration > System > Storage’).

2) Allocate RAM (‘Configuration > System > Memory’). 8gb allocated will allow approx. 5 million URLs to be crawled.

The guide below provides a more comprehensive overview of the differences between memory and database storage, the ideal set-up for crawling large websites and how to crawl intelligently to avoid wasting both time and resource unnecessarily.

What Are The Differences Between Memory & Database Storage?

Fundamentally both storage modes can still provide virtually the same crawling experience, allowing for real-time reporting, filtering and adjusting of the crawl. However, there are some key differences, and the ideal storage, will depend on the crawl scenario, and machine specifications.

Memory Storage

Memory storage mode allows for super fast and flexible crawling for virtually all set-ups. However, as machines have less RAM than hard disk space, it means the SEO Spider is generally better suited for crawling websites under 500k URLs in memory storage mode.

Users are able to crawl more than this with the right set-up, and depending on how memory intensive the website is that’s being crawled. As a very rough guide, a 64-bit machine with 8gb of RAM will generally allow you to crawl a couple of hundred thousand URLs.

As well as being a better option for smaller websites, memory storage mode is also recommended for machines without an SSD, or where there isn’t much disk space.

Database Storage

We recommend this as the default storage for users with an SSD, and for crawling at scale. Database storage mode allows for more URLs to be crawled for a given memory setting, with close to RAM storage crawling speed for set-ups with a solid state drive (SSD).

The default crawl limit is 5 million URLs, but it isn’t a hard limit – the SEO Spider is capable of crawling significantly more (with the right set-up). As an example, a machine with a 500gb SSD and 16gb of RAM, should allow you to crawl up to 10 million URLs approximately.

While not recommended, if you have a fast hard disk drive (HDD), rather than a sold state disk (SSD), then this mode can still allow you to crawl more URLs. However, writing and reading speed of a hard drive does become the bottleneck in crawling – so both crawl speed, and the interface itself will be significantly slower.

If you’re working on the machine while crawling, it can also impact machine performance, so the crawl speed might require to be reduced to cope with the load. SSDs are so fast, they generally don’t have this problem and this is why ‘database storage’ can be used as the default for both small and large crawls.

Do You Really Need To Crawl The Whole Site?

This is the question we always recommend asking. Do you need to crawl every URL, to get the data you need?

Advanced SEOs know that often it’s just not required. Generally websites are templated, and a sample crawl of page types from across various sections, will be enough to make informed decisions across the wider site.

So, why crawl 5m URLs, when 50k is enough? With a few simply adjustments, you can avoid wasting resource and time on these (more on adjusting the crawl shortly).

It’s worth remembering that crawling large sites takes up resource, but also a lot of time (and cost for some solutions). A 1 million page website at an average crawl rate of 5 URLs per second will take over two days to crawl. You could crawl faster, but most websites and servers don’t want to be crawled faster than that kind of speed.

When considering scale, it’s not just unique pages or data collected that needs to be considered, but actually the internal linking of the website. The SEO Spider records every single inlink or outlink (and resource), which means a 100k page website which has 100 site wide links on every page, actually means recording more like 10 million links.

However, with the above said, there are times where a complete crawl is essential. You may need to crawl a large website in it’s entirety, or perhaps the website is on an enterprise level with 50m pages, and you need to crawl more to even get an accurate sample. In these scenarios, we recommend the following approach to crawling larger websites.

1) Switch To Database Storage

We recommend using an SSD and switching to database storage mode. If you don’t have an SSD, we highly recommend investing. It’s the single biggest upgrade you can make to a machine, for a comparatively low investment, and allow you to crawl at huge scale, without compromising performance.

Users can select to save to disk by choosing ‘database storage mode’, within the interface (via ‘Configuration > System > Storage).

database storage mode

If you don’t have an SSD (buy one now!), then you can ignore this step, and simply follow the rest of the recommendations in this guide. It’s worth noting, you can use an external SSD with USB 3.0 if your system supports UASP mode.

2) Increase Memory Allocation

The SEO Spider as standard allocates just 1gb of RAM for 32-bit machines and 2gb of RAM for 64-bit. In memory storage mode, this should allow you to crawl between 10-150k URLs of a website. In database storage mode, this should allow you to crawl between 1-2 million URLs approximately.

The amount of RAM allocated will impact how many URLs you can crawl in both memory and database storage modes, but far more significantly in memory storage mode.

For RAM storage mode, we usually recommend a minimum of 8gb of RAM to crawl larger websites, with a couple of hundred of thousand of pages. But the more RAM you have, the better!

For database storage, 8gb of RAM crawling will allow up to 5 million URLs, 16gb for 10 million, and 32gb of RAM for above 20 million URLs. These are all approximations, as it depends on the site.

When you reach the limit of memory allocation, you will receive the following warning.

SEO Spider out of memory warning

This is warning you that the SEO Spider has reached the current memory allocation and it needs to be increased to crawl more URLs, or it will become unstable. To increase memory, first of all you should save the crawl via the ‘File > Save’ menu.

Then you can adjust the memory under ‘Configuration > System > Memory’. We generally recommend allocating 2gb less than your total RAM available.

In-App Memory Application (Finally)

The SEO Spider will only use the memory when required and this just means you have the maximum available to you if and when you need it. You can then open the saved crawl, and resume the crawl again.

The more memory you are able to allocate, the more you’ll be able to crawl. So if you don’t have a machine with much RAM available, we recommend using a more powerful machine, or upgrading the amount of RAM.

3) Adjust What To Crawl In The Configuration

The more data that’s collected and the more that’s crawled, the more memory intensive it will be. So you can consider options for reducing memory consumption for a ‘lighter’ crawl.

Deselecting the following options under ‘Configuration > Spider’ will help save memory –

  1. Check Images.
  2. Check CSS.
  3. Check JavaScript.
  4. Check SWF.
  5. Check External Links.

Please note, if you’re crawling in JavaScript rendering mode, you’ll likely need most of these options enabled, otherwise it will impact the render. Please see our ‘How To Crawl JavaScript Websites‘ guide.

You can also deselect the following crawl options under ‘Configuration > Spider’ to help save memory –

  1. Crawl Canonicals – This option only impacts crawling, they will still be extracted when deselected.
  2. Crawl Next/Prev – This option only impacts crawling, they will still be extracted when deselected.
  3. Extract Hreflang – This option means URLs in hreflang will not be extracted at all.
  4. Crawl Hreflang – This option means URLs in hreflang will not be crawled.

There are also other options that will use memory if utilised, so consider against using the following features –

  1. Custom Search.
  2. Custom Extraction.
  3. Google Analytics Integration.
  4. Google Search Console Integration.
  5. Link Metrics Integration (Majestic, Ahrefs and Moz).

This means less data, less crawling and lower memory consumption.

4) Exclude Unnecessary URLs

Use the exclude feature to avoid crawling unnecessary URLs. These might include entire sections, faceted navigation filters, particular URL parameters, or infinite URLs with repeating directories etc.

The exclude feature allows you to exclude URLs from a crawl completely, by supplying a list of a list regular expressions (regex). A URL that matches an exclude is not crawled at all (it’s not just ‘hidden’ in the interface). It’s also worth bearing in mind, that it will mean other URLs that do not match the exclude, but can only be reached from an excluded page will also not be crawled. So use the exclude with care.

We recommend performing a crawl and ordering the URLs in the ‘Internal’ tab alphabetically, and analysing them for patterns and areas for potential exclusion in real time. Generally by scrolling through the list in real time, and analysing the URLs, you can put together a list of URLs for exclusion.

For example, ecommerce sites often have faceted navigations which allow users to filter and sort, which can result in lots of URLs. Sometimes they are crawlable in a different order, resulting in many or an endless number of URLs.

Let’s take a real life scenario, like John Lewis. If you crawl the site with standard settings, due to their numerous facets, you can easily crawl filtered pages, such as below.

Selecting these facets generates URLs such as –

https://www.johnlewis.com/browse/men/mens-trousers/adidas/allsaints/gant-rugger/hymn/kin-by-john-lewis/selected-femme/homme/size=36r/_/N-ebiZ1z13yvxZ1z0g0g6Z1z04nruZ1z0s0laZ1z13u9cZ1z01kl3Z1z0swk1

This URL has multiple brands, a trouser size and delivery option selected. There are also facets for colour, trouser fit and more! The different number of combinations that could be selected are virtually endless, and these should be considered for exclusion.

By ordering URLs in the ‘Internal’ tab alphabetically, it’s easy to spot URL patterns like these for potential exclusion. We can also see that URLs from the facets on John Lewis are set to ‘noindex’ anyway. Hence, we can simply exclude them from being crawled.

noindex john lewis size pages

Once you have a sample of URLs, and have identified the issue, it’s generally not unnecessary to then crawl every facet and combination. They may also already be canonicalised, disallowed or noindex, so you know they have already been ‘fixed’, and they can simply be excluded.

5) Crawl In Sections (Subdomain or Subfolders)

If the website is very large, you can consider crawling it in sections. By default, the SEO Spider will crawl just the subdomain entered, and all other subdomains encountered will be treated as external (and appear under the ‘external’ tab). You can choose to crawl all subdomains, but obviously this will take up more memory.

The SEO Spider can also be configured to crawl a subfolder by simply entering the subfolder URI with file path and ensure ‘check links outside of start folder’ and ‘crawl outside of start folder’ are deselected under ‘Configuration > Spider’. For example, to crawl our blog, you’d then simply enter https://www.screamingfrog.co.uk/blog/ and hit start.

Screaming Frog blog crawling subfolder

Please note, that if there isn’t a trailing slash on the end of the subfolder, for example ‘/blog’ instead of ‘/blog/’, the SEO Spider won’t currently recognise it as a sub folder and crawl within it. If the trailing slash version of a sub folder redirects to a non trailing slash version, then the same applies.

To crawl this sub folder, you’ll need to use the include feature and input the regex of that sub folder (.*blog.* in this example).

6) Narrow The Crawl, By Using The Include

You can use the include feature to control which URL path the SEO Spider will crawl via regex. It narrows the default search by only crawling the URLs that match the regex, which is particularly useful for larger sites, or sites with less intuitive URL structures.

Matching is performed on the URL encoded version of the URL. The page that you start the crawl from must have an outbound link which matches the regex for this feature to work. Obviously if there is not a URL which matches the regex from the start page, the SEO Spider will not crawl anything!

As an example, if you wanted to crawl pages from https://www.screamingfrog.co.uk which have ‘search’ in the URL string you would simply include the regex:.*search.* in the ‘include’ feature.

include feature crawling

This would find the /search-engine-marketing/ and /search-engine-optimisation/ pages as they both have ‘search’ in them.

screaming frog include results

7) Limit the Crawl For Better Sampling

There’s various limits available, which help control the crawl of the SEO Spider and allow you to get a sample of pages from across the site, without crawling everything. These include –

  • Limit Crawl Total – Limit the total number of pages crawled overall. Browse the site, to get a rough estimate of how many might be required to crawl a broad selection of templates and page types.
  • Limit Crawl Depth – Limit the depth of the crawl to key pages, allowing enough depth to get a sample of all templates.
  • Limit Max URI Length To Crawl – Avoid crawling incorrect relative linking or very deep URLs, by limiting by length of the URL string.
  • Limit Max Folder Depth – Limit the crawl by folder depth, which can be more useful for sites with intuitive structures.
  • Limit Number of Query Strings – Limit crawling lots of facets and parameters by number of query strings. By setting the query string limit to ‘1’, you allow the SEO Spider to crawl a URL with a single parameter (?=colour for example), but not anymore. This can be helpful when various parameters can be appended to URLs in different combinations!

8) Buy An External SSD With USB 3.0(+)

If you don’t have an internal SSD and you’d like to crawl large websites using database storage mode, then an external SSD can help.

It’s important to ensure your machine has USB 3.0 port and your system supports UASP mode. Most new systems do automatically, if you already have USB 3.0 hardware. When you connect the external SSD, ensure you connect to the USB 3.0 port, otherwise reading and writing will be very slow.

USB 3.0 ports generally have a blue inside (as recommended in their specification), but not always; and you will typically need to connect a blue ended USB cable to the blue USB 3.0 port. After that, you need to switch to database storage mode, and then select the database location on the external SSD (the ‘D’ drive in the example below).

database storage mode with an external SSD

Simple!

9) Run The SEO Spider In The Cloud With an SSD & Lots of RAM

If you still need to crawl more, but don’t have a powerful machine with an SSD, or lots of RAM, then consider running the SEO Spider in the cloud, ensuring it has an SSD, you adjust to ‘database storage’ mode, and allocating silly amounts of RAM.

There’s some really comprehensive guides we recommend –

10) Remember To Save Regularly

If you’re pushing the SEO Spider to your memory limits, we recommend saving crawl projects regularly. If there is a problem, this means you won’t lose all the crawl.

You can save the crawl, by clicking ‘Stop’, then ‘File > Save’. Once the crawl has finished saving, simply hit ‘resume’ to continue the crawl again afterwards!

Happy Crawling!

Websites are really unique, but the basic principles outlined above, should allow you to crawl large websites more efficiently.

If you have any queries regarding our guide to crawling large websites, then please do just get in touch with our support team.

  • Like us on Facebook
  • +1 us on Google Plus
  • Connect with us on LinkedIn
  • Follow us on Twitter
  • View our RSS feed

Download.

Download

Purchase a licence.

Purchase