The Screaming Frog SEO Spider crawls (only) the subdomain you enter and treats other subdomains it encounters as external links as default. The tool spiders from directory path onwards. So if you wish to crawl a particular subfolder (or subdirectory) on your site, simply enter the URI with file path. For example, if it was your blog, it might be – http://www.example.com/blog/. This will crawl all URI contained within the /blog/ sub directory. If you have a more complicated set up like subdomains and subfolders you can specifiy both. For example – http://de.example.com/uk/ to spider the .de subdomain and UK sub folder etc.
For better control of your crawl, use the URI structure of your website, the SEO spiders configuration options such as crawling only html (or images, css, JS etc), the exclude function, the include function or alternatively change the mode of the SEO spider and upload a list of URI to crawl. If you wish to perform a particulary large crawl, we would advise increasing the RAM memory allocation of the spider.
In the premium version of the tool you can save your crawls and re-upload them back into the spider. The files are saved as a .seospider file type specific to the Screaming Frog SEO Spider.
You can save projects part way through a crawl by stopping the spider and selecting ‘file’ then ‘save’. To re-upload a file, simply choose ‘file’ then ‘open’ or choose one of your recent crawls under ‘file’ and ‘open recent’. You can then resume the crawl if if saved part way through. Please note, saving and re-uploading crawls can take a number of minutes depending on the size of the crawl.
The export function in the top window section works with your current field of view in the top window. Hence, if you are using a filter and click ‘export’ it will only export the data contained within the filtered option.
There are three main methods to export data -
You can also view our video guide about exporting from the SEO Spider –
The Screaming Frog SEO Spider is robots.txt compliant. It obeys robots.txt in the same way as Google.
It will check robots.txt of the (sub) domain and follow (allow/disallow) directives specifically for the Screaming Frog SEO Spider user-agent, if not Googlebot and then ALL robots. It will follow any directives for Googlebot currently as default. Hence, if certain pages or areas of the site are disallowed for Googlebot, the spider will not crawl them either. The tool supports URL matching of file values (wildcards * / $) just like Googlebot.
A couple of things to remember here -
The spider obeys robots.txt protocol. Its user agent is ‘Screaming Frog SEO Spider’ so you can include the following in your robots.txt if you wish to block the spider -
User-agent: Screaming Frog SEO Spider
Or alernatively if you wish to exclude certain areas of your site specifically for the SEO spider, simply use the usual robots.txt syntax with our user-agent. Please note – There is an option to ‘ignore’ robots.txt, which is down to the responsibility of the user entirely.
The Screaming Frog SEO Spider as standard allocates 512mb of RAM. If you are crawling particulary large sites, you may need to increase the memory allocation of the spider.
There is not a set number or pages the SEO spider can crawl, it is dependent on the complexity of the site and a number of other factors. Generally speaking with the standard memory allocation of 512mb the spider can crawl between 10K-100K URI of a site. If you have received the following ‘high memory usage’ warning message when performing a crawl –
Or if you are experiencing a slow down in crawl or of the program itself on a large crawl, this will be due to memory allocation.
This is essentially, warning you that the SEO spider has reached the current memory allocation and it needs to be increased to crawl more URLs. To do this, you should save the crawl via the ‘file’ and ‘save’ menu. You can then follow the instructions below to increase your memory, before opening the saved crawl and resuming again.
Windows 32 & 64-bit: First of all, if you have a 64-bit machine, ensure you download and install the 64-bit version of Java or you will not be able to allocate anymore than a 32-bit machine. Look in the folder the spider is installed in (default is C:Program Files/Screaming Frog SEO Spider), there should be 4 files: two application files (install and uninstall), a .jar file and then the file we need to edit, which is a configuration file called ‘ScreamingFrogSEOSpider.l4j’ a .ini file. If you open this file up in notepad you will notice this has a line and number ‘-Xmx512M’ which reflects the total memory assigned for the SEO spider ’512Mb’.
The default number is 512M, so to double the memory you can simply replace ‘512’ with ‘1024’, or to allocate 6GB simply replace with ’6144′ for example (please leave the -Xmx and M text, so 1,024 would appear as -Xmx1024M in the file).
Please note, this is RAM (rather than hard disk memory), so if you set it higher than you actually have on your machine, the spider won’t start. If this happens, edit the file again to a more realistic memory allocation based on your machines memory available.
You can read about memory limits for Windows here, but essentially 32-bit Windows machines are limited to 4GB of RAM. This generally means the maximum memory allocation will be between 1,024mb and 1,500mb as this is all that will actually be available.
For 64-bit machines, you will be able to allocate significantly more, obviously dependent on how much memory your machine has. The SEO spider is built for 32 and 64-bit machines, but please remember to install the 64-bit version of Java to be able to allocate as much memory as your system will allow on a 64-bit machine.
If Windows will not allow you to edit the file directly (you will probably need administration rights), try copying the file to your desktop, editing and then pasting back into the folder and replacing the original file.
To check the memory increase has taken effect restart (or just start) the SEO spider and go to ‘help’, then ‘debug’ and look at the ‘max’ figure. Note: this will always be a bit less than you’ve allocated, that’s normal and down to JVM management.
Mac – Open ‘/Applications/Screaming Frog SEO Spider.app/Contents/Info.plist’ in a text editor, locate (on line 30′ish) and change the following string -
Increase this number to the allocation you wish to assign, to double it for example -
If you can’t see the info.plist file, go to the ‘applications’ folder using the finder. Right click on ‘ScreamingFrogSEOSpider’ and select ‘Show package contents’ and the info.plist file is in the ‘contents’ folder. Right click on it and open with a text editor.
Ubuntu – To increase the memory to crawl larger websites, change the number in .screamingfrogseospider
In the users home directory, it’s created when the SEO spider runs, if the file does not exist.
The file contents by default is: -Xmx512M
Simply amend the the number as per the examples above.
If you are not able to increase your memory allocation, we recommend crawling large sites in sections. You can use the configuration menu to just crawl html (rather than images, CSS or JS) or exclude certain sections of the site. Alternatively if you have a nicely structured IA you can crawl by directory (/holidays/, /blog/ etc) or use our include regex crawling feature. The tool was not built to crawl entire sites with hundreds of thousands of pages to pick up every single issue as it currently uses chip memory rather than a database.
The Screaming Frog SEO Spider does not accept cookies; like search engine bots as default. However, under the spider configuration ‘advanced tab’, there is an option to now allow cookies.
The Screaming Frog SEO Spider allows you to create an XML sitemap under the ‘XML sitemap’ option located under ‘Advanced export’ in the top level navigation. Currently this feature is only for Html pages, so it does not include images, videos etc (which we will be introducing at a later date). We conform to the standards outlined in sitemaps.org protocol.
Only html pages with a 200 response from a crawl will be included in the sitemap, so no 3XX, 4XX or 5XX pages. Hence, pages with ‘noindex’ and ‘canonical’ elements on them maybe included in a sitemap if they have been crawled (and included in the ‘internal’ tab). A quick way to remove URLs such as these is to simply to highlight them in the user interface, right click and ‘remove’ before creating the XML sitemap. Alternatively you can export the ‘internal’ tab to Excel, filter and delete any URLs that are not required and re-upload the file in list mode before exporting the sitemap. Alternatively, simply block them via the exclude feature or robots.txt before a crawl.
If you have over 49,999 urls the SEO spider will automatically create additional sitemap files and create a sitemap index file referencing the sitemap locations. Please note – The SEO spider sets the ‘lastmod’ date as the current date (on creation), the ‘changefreq’ as daily for all URI and ‘priority’ as 1.0 for the start page of the crawl (generally the homepage) and 0.5 everywhere else. Hence, please amend all of these to your own requirements. If you are unsure about these, then they can simply be removed, just leaving the ‘loc’ and ‘lastmod’.
Redirect Chains Report: This report maps out chains of redirects, the number of hops along the way and will identify the source, as well as if there is a loop. The redirect chain report can also be used in list mode, alongside the ‘Always follow redirects‘ option which is very useful in site migrations. When you tick this box, the SEO spider will continue to crawl redirects in list mode and ignore crawl depth, meaning it will report back upon all hops until the final destination. Please note – If you only perform a partial crawl, or some URLs are blocked via robots.txt, you may not receive all response codes for URLs in this report.
Crawl Overview Report: This report provides a summary of the crawl, including data such as, the number of URLs encountered, those blocked by robots.txt, the number crawled, the content type, response codes etc. The ‘total URI description’ provides information on what the ‘Total URI’ column number is for each individual line to (try and) avoid any confusion.
Crawl Path Report: This report is not under the ‘reports’ drop down in the top level menu, it’s available upon right-click of a URL in the top window pane and within the ‘export’ option. For example –
This report shows you the path the SEO Spider crawled to discover the URL which can be really useful for deep pages, rather than viewing ‘in links’ of lots of URLs to discover the original source URL (for example, for infinite URLs). The crawl path report should be read from bottom to top. The first URL at the bottom of the ‘source’ column is the very first URL crawled (with a ’0′ level). The ‘destination’ shows which URLs were crawled next, and these make up the following ‘source’ URLs for the next level (1) and so on, upwards. The final ‘destination’ URL at the very top of the report will be the URL of the crawl path report!