SEO Spider General

Download | User Guide | FAQ | Support | Terms | Purchase
General | Configuration | Tabs

Installation

The Screaming Frog SEO Spider can be downloaded by clicking on the appropriate download button for your operating system and then running the installer.

The minimum specification is a machine able to run Java 7 with at least 512Mb of RAM. The number of URLs you can crawl, is based upon how much memory you can allocate to the tool.

For more details of how to install Java 7 view our Java 7 installation guide.

Crawling

The Screaming Frog SEO Spider is free to download and use for crawling up to 500 URLs at a time. For £99 per annum you can purchase a licence which removes the 500 URL crawl limit and opens up the spider’s configuration options.

In regular crawl mode, the SEO Spider will crawl the subdomain you enter and treat all other subdomains it encounters as external links by default. In the licenced version of the software, you can adjust the configuration to choose to crawl all sub domains of a website.

If you just wish to use the SEO Spider as a broken link checker for a website, then read our guide on how to find broken links using the SEO Spider, which explains how to view 404 errors and export the source data in bulk.

For better control of your crawl, use the URI structure of your website, the SEO Spiders configuration options such as crawling only html (images, CSS, JS etc), the exclude function, the include function or alternatively change the mode of the SEO spider and upload a list of URI to crawl (as discussed further below in this guide).

Crawling A Sub Folder

The SEO Spider tool crawls from sub folder path fowards by default, so if you wish to crawl a particular subfolder on your site, simply enter the URI with file path. For example, if it’s a blog, it might be – http://www.screamingfrog.co.uk/blog/, like our own. By entering this directly into the SEO Spider, it will crawl all URI contained within the /blog/ sub directory.

You may notice some URLs which are not within the /blog/ sub folder are crawled as well by default. This will be due to the ‘check links outside of start folder‘ configuration. This configuration allows the SEO Spider to crawl within the /blog/ directory, but still crawl links that are not within this directory, when they are linked to from within side it. However, it will not crawl any further onwards. This is useful as you may wish to find broken links that sit within the /blog/ sub folder, but don’t have /blog/ within the URL structure. To only crawl URLs with /blog/, simply untick this configuration.

Please note, that if there isn’t a trailing slash on the end of the sub folder, for example ‘/blog’ instead of ‘/blog/’, the SEO Spider won’t currently recognise it as a sub folder and crawl within it. If the trailing slash verion of a sub folder redirects to a non trailing slash version, then the same applies.

To crawl this sub folder, you’ll need to use the include feature and input the regex of that sub folder (.*blog.* in this example). If you have a more complicated set up like subdomains and subfolders you can specifiy both. For example – http://de.example.com/uk/ to spider the .de subdomain and UK sub folder etc.

Crawling A List Of URLs

As well as crawling a website by entering a URL and clicking ‘Start’, you can switch to list mode and either paste or upload a list of specific URLs to crawl.

This can be particularly useful for site migrations when auditing redirects for example. I recommend reading our ‘How To Audit Redirects In A Site Migration‘ guide.

Crawling Larger Websites

If you wish to perform a particulary large crawl, we would advise increasing the RAM memory allocation in the SEO Spider.

If you receive ‘you are running out of memory for this crawl’ warning, then you will need to save the project, increase the RAM allocation, re-upload and resume the crawl. The number of URLs the SEO Spider can crawl is down to the amount of memory available on the machine and whether it’s allocated.

Saving & Uploading Crawls

In the licenced version of the tool you can save your crawls and re-upload them back into the spider. The files are saved as a .seospider file type specific to the Screaming Frog SEO Spider.

You can save projects part way through a crawl by stopping the spider and selecting ‘file’ then ‘save’. To re-upload a file, simply choose ‘file’ then ‘open’ or choose one of your recent crawls under ‘file’ and ‘open recent’. You can then resume the crawl if saved part way through. Please note, saving and re-uploading crawls can take a number of minutes depending on the size of the crawl.

Exporting

The export function in the top window section works with your current field of view in the top window. Hence, if you are using a filter and click ‘export’ it will only export the data contained within the filtered option.

There are three main methods to export data –

You can also view our video guide about exporting from the SEO Spider –

Robots.txt

The Screaming Frog SEO Spider is robots.txt compliant. It obeys robots.txt in the same way as Google.

It will check robots.txt of the (sub) domain and follow (allow/disallow) directives specifically for the Screaming Frog SEO Spider user-agent, if not Googlebot and then ALL robots. It will follow any directives for Googlebot currently as default. Hence, if certain pages or areas of the site are disallowed for Googlebot, the spider will not crawl them either. The tool supports URL matching of file values (wildcards * / $) just like Googlebot.

You can choose to ignore the robots.txt (it won’t even download it) in the premium version of the software by selecting the option. Configuration -> Spider -> Ignore robots.txt.

A couple of things to remember here  –

User-Agent

The spider obeys robots.txt protocol. Its user agent is ‘Screaming Frog SEO Spider’ so you can include the following in your robots.txt if  you wish to block the spider –

User-agent: Screaming Frog SEO Spider

Disallow: /

Or alernatively if you wish to exclude certain areas of your site specifically for the SEO spider, simply use the usual robots.txt syntax with our user-agent. Please note – There is an option to ‘ignore’ robots.txt, which is down to the responsibility of the user entirely.

Memory

The Screaming Frog SEO Spider as standard allocates 512mb of RAM. If you are crawling particulary large sites, you will need to increase the memory allocation of the SEO Spider.

There is not a set number of URLs the SEO spider can crawl at the standard memory allocation, it is dependent on the complexity of the site and a number of other factors. Generally speaking with the standard memory allocation of 512mb the SEO Spider can crawl between 10K-100K URI of a site. If you have received the following ‘high memory usage’ warning message when performing a crawl –

screaming frog seo spider high memory usage message

Or if you are experiencing a slow down in crawl or of the program itself on a large crawl, this will be due to reaching the memory allocation.

This is warning you that the SEO spider has reached the current memory allocation and it needs to be increased to crawl more URLs. To do this, you should save the crawl via the ‘file’ and ‘save’ menu. You can then follow the instructions below to increase your memory allocation, before opening the saved crawl and resuming it again.

Increasing Memory On Windows 32 & 64-bit

First of all, if you have a 64-bit machine, ensure you download and install the 64-bit version of Java or you will not be able to allocate anymore than a 32-bit machine and you’ll received a ‘Could not create the Java virtual machine’ error message on start-up.

To update the memory, simply navigate to the folder the SEO Spider is installed in (default is C:Program Files/Screaming Frog SEO Spider), there should be 4 files: two application files (install & uninstall), a .jar file and then the file we need to edit, which is a configuration file called ‘ScreamingFrogSEOSpider.l4j’ a .ini file. If you open this file up in notepad you will notice this has a line and number ‘-Xmx512M’ which reflects the total memory assigned for the SEO spider ‘512Mb’.

The default number is 512M, so to double the memory you can simply replace ‘512’ with ‘1024’, or to allocate 6GB simply replace with ‘6144’ for example (please leave the -Xmx and M text, so 1,024 would appear as -Xmx1024M in the file). Here is a screenshot of around 9GB –

memory allocation

Please note, this is RAM (rather than hard disk memory). As explained above, if you received a ‘Could not create the Java virtual machine’ error message after increasing memory –

could not create java virtual machine

Then it will be due to either –

1) You’re using the 32-bit version of Java. You need to manually choose and install the 64-bit version of Java. If you already have the 64-bit version of Java, then uninstall all versions of Java and reinstall the 64-bit version again manually.

2) You have allocated more memory then you actually have available. If this happens, edit the file again to a more realistic memory allocation based on your machines memory available.

You can read about memory limits for Windows here, but essentially 32-bit Windows machines are limited to 4GB of RAM. This generally means the maximum memory allocation will be between 1,024mb and 1,500mb, as this is all that will actually be available.

For 64-bit machines, you will be able to allocate significantly more, obviously dependent on how much memory your machine has. The SEO Spider is built for 32 and 64-bit machines, but please remember to install the 64-bit version of Java to be able to allocate as much memory as your system will allow on a 64-bit machine.

If Windows will not allow you to edit the file directly (you will probably need administration rights), try copying the file to your desktop, editing and then pasting back into the folder and replacing the original file.

To check the memory increase has taken effect restart (or just start) the SEO spider and go to ‘help’, then ‘debug’ and look at the ‘max’ figure. Note: this will always be a bit less than you’ve allocated, that’s normal and down to JVM management.

memory in debug

Increasing Memory On Mac OS X

There are two methods to amend your SEO Spider memory dependent on what version of the SEO Spider you are using. If you are using the latest version (2.55 or higher), then simply-

Open a ‘Terminal’ (found in the ‘Utilities’ folder in the ‘Applications’ folder, or directly using spotlight and typing: ‘terminal’) and type:

defaults write uk.co.screamingfrog.seo.spider Memory 1g

This allocates 1GB of memory to the SEO Spider. To allocate 8GB:

defaults write uk.co.screamingfrog.seo.spider Memory 8g

You can also specify a memory figure in megabytes, using the m suffix:

defaults write uk.co.screamingfrog.seo.spider Memory 2048m

These memory settings will be remembered over upgrades and you’ll only need to do it once. You can view the currently assigned memory value by issuing the read variant of the defaults command:

defaults read uk.co.screamingfrog.seo.spider Memory

Which will return something like:

8g

By default no value is set, so you will get a message like this:

The domain/default pair of (uk.co.screamingfrog.seo.spider, Memory) does not exist

and the SEO Spider will use 512MB.

If you’re using a Mac with OS X below Version 10.7.3 (32-Bit Macs) then you will have to use version 2.40 or lower of the SEO Spider and you will need to update your memory using the following method –

Open ‘Finder’ and navigate to the ‘Applications’ folder, probably listed under ‘Favourites’, as below. Select ‘Screaming Frog SEO Spider’, right click and choose ‘Show Package Contents’.

show package contents osx

Then expand the ‘Contents’ folder, select ‘Info.plist’, right click and choose ‘Open With’ and then ‘Other’.

other text edit

In the resulting prompt menu, choose ‘TextEdit’.

text edit

Now find the following section and change the values appropriately (on line 30′ish) :

VMOptions
Edit the value below to change memory settings – -Xmx1024M for 1GB etc
-Xmx512M

Choose ‘File’ then ‘Save’, then ‘TextEdit’ and ‘Quit TextEdit’. Then re-launch the SEO Spider and your new memory settings will now be active.

Increasing Memory On Ubuntu

To increase the memory for crawling larger websites, change the number in the ~/.screamingfrogseospider file. If this file does not exist, it will be created when the SEO Spider runs. The default contents of the file is: -Xmx512M.

Note: To avoid any typos, please copy and paste the examples below.

To amend this file, open a terminal and type the following to allocate 1GB of memory.

echo "-Xmx1g" > ~/.screamingfrogseospider

2GB can be allocated as follows:

echo "-Xmx2g" > ~/.screamingfrogseospider

You can also allocated memory in megabytes, rather than gigabytes. Here we’re allocating 1.5 Gigabytes.

echo "-Xmx1500M" > ~/.screamingfrogseospider

Please Note: The Max memory figure show in Help->Debug will always be less than that allocated, as some memory is reserved for the Java Virtual Machine.

If you are not able to increase your memory allocation, we recommend crawling large sites in sections. You can use the configuration menu to just crawl html (rather than images, CSS or JS) or exclude certain sections of the site. Alternatively if you have a nicely structured IA you can crawl by directory (/holidays/, /blog/ etc) or use our include regex crawling feature. The tool was not built to crawl entire sites with hundreds of thousands of pages to pick up every single issue as it currently uses chip memory rather than a database.

Cookies

By default the Screaming Frog SEO Spider does not accept cookies, just like search engine bots. However, under Configuration->Spider in the “Advanced” tab there is an option to accept cookies. This is useful for crawling sites that mandate the client accepts cookies to be able to crawl it.

XML Sitemap Creation

The Screaming Frog SEO Spider allows you to create an XML sitemap or a specific image XML sitemap, located under ‘Sitemaps’ in the top level navigation.

xml sitemap menu

The ‘Create XML Sitemap’ feature allows you to create an XML Sitemap with all HTML 200 response pages discovered in a crawl, as well as PDFs and images. The ‘Create Images Sitemap’ is a little bit different to the ‘Create XML Sitemap’ option and including ‘images’. This option includes all images with a 200 response and ONLY pages that have images on them.

If you have over 49,999 urls the SEO spider will automatically create additional sitemap files and create a sitemap index file referencing the sitemap locations. The SEO Spider conforms to the standards outlined in sitemaps.org protocol.

Read our detailed tutorial on how to use the SEO Spider as an XML Sitemap Generator, or continue below for a quick overview of each of the XML Sitemap configuration options.

Adjusting Pages To Include

By default, only HTML pages with a ‘200’ response from a crawl will be included in the sitemap, so no 3XX, 4XX or 5XX responses. Pages which are ‘noindex’, ‘canonicalised’ (the canonical URL is different to the URL of the page), paginated (URLs with a rel=“prev”) or PDFs are also not included as standard, but this can be adjusted within the XML Sitemap ‘pages’ configuration.

xml sitemap options

If you have crawled URLs which you don’t want included in the XML Sitemap export, then simply highlight them in the user interface, right click and ‘remove’ before creating the XML sitemap. Alternatively you can export the ‘internal’ tab to Excel, filter and delete any URLs that are not required and re-upload the file in list mode before exporting the sitemap. Alternatively, simply block them via the exclude feature or robots.txt before a crawl.

Last Modified

It’s optional whether to include the ‘lastmod’ attribute in a XML Siteamp, so this is also optional in the SEO Spider. This configuration allows you to either use the server response, or a custom date for all URLs.

XML Sitemap lastmod

Priority

It’s optional whether to include the ‘priority’ attribute and the SEO Spider allows you to configure these based upon ‘level’ (the depth) of the URLs. You can view the ‘level’ of URLs under the ‘level’ column in the ‘Internal’ tab.

As shown in the screenshot below, by default the homepage (or start page of the crawl) is set to the highest priority of ‘1’, descending by 0.1 in priority by each level of depth down to 0.5 for level 5+.

xml sitemap priority

Change Frequency

It’s optional whether to include the ‘changefreq’ attribute and the SEO Spider allows you to configure these based from the ‘last modification header’ or ‘level’ (depth) of the URLs. The ‘calculate from last modified header’ option means if the page has been changed in the last 24 hours, it will be set to ‘daily’, if not, it’s set as ‘monthly’.

xml sitemap changefreq

Images

It’s entirely optional whether to include images in the XML sitemap. If the ‘include images’ option is ticked, then all images under the ‘Internal’ tab (and under ‘Images’) will be included by default. As shown in the screenshot below, you can also choose to include images which reside on a CDN and appear under the ‘external’ tab within the UI.

xml sitemap image options

Typically images like logos or social profile icons are not included in an image sitemap, so you can also choose to only include images with a certain number of source attribute references to help exclude these. Often images like logos are linked to sitewide, while images on product pages for example might only be linked to once of twice. There is a IMG Inlinks column in the ‘images’ tab which shows how many times an image is referenced to help decide the number of ‘inlinks’ which might be a suitable for inclusion.

Reports

There’s a variety of reports which can be accessed via the ‘reports’ top level navigation. These include, as follows –

Crawl Overview Report

This report provides a summary of the crawl, including data such as, the number of URLs encountered, those blocked by robots.txt, the number crawled, the content type, response codes etc. The ‘total URI description’ provides information on what the ‘Total URI’ column number is for each individual line to (try and) avoid any confusion.

Redirect Chains Report

This report maps out chains of redirects, the number of hops along the way and will identify the source, as well as if there is a loop. The redirect chain report can also be used in list mode, alongside the ‘Always follow redirects‘ option which is very useful in site migrations. When you tick this box, the SEO spider will continue to crawl redirects in list mode and ignore crawl depth, meaning it will report back upon all hops until the final destination. Please note – If you only perform a partial crawl, or some URLs are blocked via robots.txt, you may not receive all response codes for URLs in this report.

Canonical Errors Report

This report highlights errors and issues with canonicals. In particular, this report will show any canonicals which have a no response, 3XX redirect, 4XX or 5XX error (anything other than a 200 ‘OK’ response). This report also provides data on any URLs which are discovered only via a canonical and are not linked to from the site (in the ‘unlinked’ column when ‘true’).

Insecure Content Report

The insecure content report will show any secure (HTTPS) URLs which have insecure elements on them, such as internal HTTP links, images, JS, CSS, SWF or external images on a CDN, social profiles etc. When you’re migrating a website to secure (HTTPS) from non secure (HTTP), it can be difficult to pick up all insecure elements and this can lead to warnings in a browser –

insecure content warning

Here’s a quick example of how a report might look (with insecure images in this case) –

insecure content report

This report does not at this time does not consider canonicals, so if a HTTPS URL has a HTTP canonical, this will not be included in this report. However, these can be seen as usual under the ‘canonicalised’ filter in the ‘Directives’ tab.

SERP Summary Report

This report allows you to quickly export URLs, page titles and meta descriptions with their respective character lengths and pixel widths. This report can also be used for a template to re-upload back into the SEO Spider in ‘SERP’ mode.

Crawl Path Report

This report is not under the ‘reports’ drop down in the top level menu, it’s available upon right-click of a URL in the top window pane and within the ‘export’ option. For example –
crawl path report

This report shows you the path the SEO Spider crawled to discover the URL which can be really useful for deep pages, rather than viewing ‘inlinks’ of lots of URLs to discover the original source URL (for example, for infinite URLs caused by a calendar).

The crawl path report should be read from bottom to top. The first URL at the bottom of the ‘source’ column is the very first URL crawled (with a ‘0’ level). The ‘destination’ shows which URLs were crawled next, and these make up the following ‘source’ URLs for the next level (1) and so on, upwards. The final ‘destination’ URL at the very top of the report will be the URL of the crawl path report!

Command Line & Scheduling

You can use the command line to start a crawl. Please see our post How To Schedule A Crawl By Command Line In The SEO Spider for more information on scheduling a crawl.

Supplying no arguments starts the application as normal. Supplying a single argument of a file path, tries to load that file in as a saved crawl. Supplying the following:

--crawl http://www.example.com/

starts the spider and immediately triggers the crawl of the supplied domain. This switches the spider to crawl mode if its not the last used mode and uses your default configuration for the crawl.

Note: If your last used mode was not crawl, “Ignore robots.txt” and “Limit Search Depth” will be overwritten.

Windows

Open a command prompt (Start button, then search programs and files for ‘Windows Command Processor’)

Move into the SEO Spider directory:

cd "C:\Program Files\Screaming Frog SEO Spider"

To start normally:

ScreamingFrogSEOSpider.exe

To open a crawl file (Only available to licensed users):

ScreamingFrogSEOSpider.exe C:\tmp\crawl.seospider

To auto start a crawl:

ScreamingFrogSEOSpider.exe --crawl http://www.example.com/

windows cli

MAC OS X

Open a terminal, found in the Utilities folder in the Applications folder, or directly using spotlight and typing: ‘terminal’.

To start normally:

open "/Applications/Screaming Frog SEO Spider.app" 

To open a saved crawl file:

open "/Applications/Screaming Frog SEO Spider.app" /tmp/crawl.seospider

To auto start a crawl:

open "/Applications/Screaming Frog SEO Spider.app" --args --crawl http://www.example.com/

mac cli

Linux

The following commands are available from the command line:

To start normally:

screamingfrogseospider

To open a saved crawl file:

screamingfrogseospider /tmp/crawl.seospider

To auto start a crawl:

screamingfrogseospider --crawl http://www.example.com/

Follow Us!

Why Purchase A Licence?

  • The 500 URI crawl limit is removed
  • You can access ALL the configuration options
  • You can save and re-upload crawls
  • You can search for anything in the source code of a website with the custom source code search feature
  • You get support for any technical issues with the software
Buy a Screaming Frog SEO Spider Licence

Contact Us

Screaming Frog Ltd
Market Chambers,
33 Market Place,
Henley-on-Thames,
Oxfordshire,
RG9 2AA

Tel: +44 (0)1491 415070
Fax: +44 (0)1491 410208
info@screamingfrog.co.uk

Latest From Twitter

  • RT @dr_pete: Whatever Google fixed in local search, it didn't change the increase in local packs -- http://t.co/zezk9DUD3r

    Retweet Reply Favorite
  • V good. https://t.co/fzUUHNy0uZ

    Retweet Reply Favorite

Looking For Something?