The Screaming Frog SEO Spider can be downloaded by clicking on the appropriate download button for your operating system and then running the installer. The SEO Spider is available for Windows, Mac and Ubuntu Linux. Version 8.0 is is the last version of the SEO Spider to support Windows XP.
The minimum specification is a machine with at least 1gb of RAM. The SEO Spider is capable of crawling millions of URLs with the correct hardware, memory and storage. It is able to save crawl data in RAM, or a database.
For crawls under 100-200k URLs a 64-bit OS and 8gb of RAM should be sufficient. However, to be able to crawl millions of URLs, an SSD and 16gb of RAM (or higher) is our recommended hardware.
The Screaming Frog SEO Spider is free to download and use for crawling up to 500 URLs at a time.
In regular crawl mode, the SEO Spider will crawl the subdomain you enter and treat all other subdomains it encounters as external links by default (these appear under the ‘external’ tab). In the licenced version of the software, you can adjust the configuration to choose to crawl all sub domains of a website.
One of the most common uses of the SEO Spider is to find errors on a website, such as broken links, redirects and server errors. Please read our guide on how to find broken links, which explains how to view the source of errors such as 404s, and export the source data in bulk to a spread sheet.
For better control of your crawl, use the URL structure of your website, the SEO Spiders configuration options such as crawling only HTML (images, CSS, JS etc), the exclude function, the custom robots.txt, the include function or alternatively change the mode of the SEO Spider and upload a list of URL to crawl (as discussed further below in this guide).
Crawling A Sub Folder
The SEO Spider tool crawls from sub folder path forwards by default, so if you wish to crawl a particular sub folder on a site, simply enter the URL with file path. For example, if it’s a blog, it might be – https://www.screamingfrog.co.uk/blog/, like our own blog. By entering this directly into the SEO Spider, it will crawl all URLs contained within the /blog/ sub directory.
You may notice some URLs which are not within the /blog/ sub folder are crawled as well by default. This will be due to the ‘check links outside of start folder‘ configuration. This configuration allows the SEO Spider to focus it’s crawl within the /blog/ directory, but still crawl links that are not within this directory, when they are linked to from inside it. However, it will not crawl any further onwards. This is useful as you may wish to find broken links that sit within the /blog/ sub folder, but don’t have /blog/ within the URL structure. To only crawl URLs with /blog/, simply untick this configuration.
Please note, that if there isn’t a trailing slash on the end of the sub folder, for example ‘/blog’ instead of ‘/blog/’, the SEO Spider won’t currently recognise it as a sub folder and crawl within it. If the trailing slash version of a sub folder redirects to a non trailing slash version, then the same applies.
To crawl this sub folder, you’ll need to use the include feature and input the regex of that sub folder (.*blog.* in this example).
If you have a more complicated set up like subdomains and subfolders you can specify both. For example – http://de.example.com/uk/ to Spider the .de subdomain and UK sub folder etc.
Crawling A List Of URLs
As well as crawling a website by entering a URL and clicking ‘Start’, you can switch to list mode and either paste or upload a list of specific URLs to crawl.
This can be particularly useful for site migrations when auditing redirects for example. I recommend reading our ‘How To Audit Redirects In A Site Migration‘ guide.
Crawling Larger Websites
If you wish to perform a particularly large crawl, we recommend increasing the RAM memory allocation in the SEO Spider first.
If you receive ‘you are running out of memory for this crawl’ warning, then you will need to save the crawl, increase the RAM allocation and consider switching to database storage mode to save to disk, open the crawl and resume the crawl.
The number of URLs the SEO Spider can crawl is down to the amount of memory available on the machine and whether it’s allocated, and whether you’re crawling in default memory storage, or database storage mode.
For really large crawls, have a read of our guide on how to crawl large websites which provides an overview of the available options.
You may wish to consider breaking up crawls into smaller sections and using the configuration to control your crawl. Some options include –
- Crawling by subdomain, or subfolder as discussed above.
- Narrowing the crawl by using the include function, or excluding areas you don’t need to crawl, by using the exclude or custom robots.txt features.
- Considering limiting the crawl by total URLs crawled, depth and number of query string parameters.
These should all help save memory and focus the crawl on the important areas you require. Please see our more in depth guide on how to crawl large websites.
Saving & uploading crawls
In the licensed version of the tool you can save your crawls and open them back into the SEO Spider. The files are saved as a .seospider file type specific to the Screaming Frog SEO Spider.
You can save crawls part way through by stopping the SEO Spider and selecting ‘File > Save’.
To open a crawl, simply double click on the relevant .seospider filder, choose ‘File > Open’ or choose one of your recent crawls under ‘File > Open Recent’. You can then resume the crawl if saved part way through.
Please note, saving and opening crawls can take a number of minutes or much longer, depending on the size of the crawl and amount of data.
In the licensed version of the tool you can save a default crawl configuration, and save configuration profiles, which can be loaded when required.
To save the current configuration as default choose ‘File > Configuration > Save Current Configuration As Default’.
To save the configuration profile to be able to load in the future, click ‘File > Save As’ and adjust the file name (ideally to something descriptive!).
To load a configuration profile, click ‘File > Load’ and choose your configuration profile, or ‘File > Load Recent’ to select from a recent list.
To reset back to the original SEO Spider default configuration choose ‘File > Configuration > Clear Default Configuration’.
The export function in the top window section works with your current field of view in the top window. Hence, if you are using a filter and click ‘export’ it will only export the data contained within the filtered option.
There are three main methods to export data –
- Exporting Top Window Data – Simply click the ‘export’ button in the top left hand corner to export data from the top window tabs.
- Exporting Lower Window Data (URL Info, In Links, Out Links, Image Info) – To export any of this data, simply right click on the URL that you wish to export data from in the top window, then click ‘export’ and either ‘URL Info’, ‘In Links’, ‘Out Links’ or ‘Image Info’.
- Bulk Export – This is located under the top level menu and allows bulk exporting of data. You can export all instances of a link found in a crawl via the ‘all in links’ option, or export all in links to URLs with specific status codes such as 2XX, 3XX, 4XX or 5XX responses. For example, selecting the ‘Client Error 4XX In Links’ option will export all in links to all error pages (such as 404 error pages). You can also export all image alt text, all images missing alt text and all anchor text across the site.
You can also view our video guide about exporting from the SEO Spider –
Bulk Export Options
- All Inlinks: Links to every page the SEO Spider crawled. This will contain every link to every URI shown under the Response Codes tab in the All filter.
- All OutLinks: All links the SEO Spider saw during crawling. This will contain every link contained in every URI in the Response Codes tab in the All filter.
- All Anchor Text: All HREF links to URIs in the All filter in the Response Codes tab.
- XXX Inlinks: All links to the XXX filter and tab.
- All Image Alt Text: All links to all Images in the All filter in the Images tab.
- Images Missing Alt Text: All IMG links to images in the Missing Alt Text filter in the Images tab.
The Screaming Frog SEO Spider is robots.txt compliant. It obeys robots.txt in the same way as Google.
It will check the robots.txt of the subdomain(s) and follow (allow/disallow) directives specifically for the Screaming FrogSEO Spider user-agent, if not Googlebot and then ALL robots. It will follow any directives for Googlebot currently as default. Hence, if certain pages or areas of the site are disallowed for Googlebot, the SEO Spider will not crawl them either. The tool supports URL matching of file values (wildcards * / $), just like Googlebot, too.
You can choose to ignore the robots.txt (it won’t even download it) in the paid (licenced) version of the software by selecting ‘Configuration > robots.txt > Settings > Ignore robots.txt’.
You can also view URLs blocked by robots.txt under the ‘Response Codes’ tab and ‘Blocked by Robots.txt’ filter. This will also show the matched robots.txt line of the disallow against each blocked URL.
Finally, there is also a custom robots.txt configuration, which allows you to download, edit and test a site’s robots.txt under ‘Configuration > robots.txt > Custom’. Please read our user guide about using the Screaming Frog SEO Spider as a robots.txt tester.
A few things to remember here –
- The SEO Spider only follows one set of user agent directives as per robots.txt protocol. Hence, priority is the Screaming FrogSEO Spider UA if you have any. If not, the SEO Spider will follow commands for the Googlebot UA, or lastly the ‘ALL’ or global directives.
- To reiterate the above, if you specify directives for the Screaming Frog SEO Spider, or Googlebot then the ALL (or ‘global’) bot commands will be ignored. If you want the global directives to be obeyed, then you will have to include those lines under the specific UA section for the SEO Spider or Googlebot.
- If you have conflicting directives (ie an allow and disallow to the same file path) then a matching allow directive beats a matching disallow if it contains equal or more characters in the command.
The SEO Spider obeys robots.txt protocol. Its user agent is ‘Screaming Frog SEO Spider’ so you can include the following in your robots.txt if you wish to block it –
User-agent: Screaming Frog SEO Spider
Or alternatively if you wish to exclude certain areas of your site specifically for the SEO Spider, simply use the usual robots.txt syntax with our user-agent. Please note – There is an option to ‘ignore robots.txt’, which is down to the responsibility of the user entirely.
The Screaming Frog SEO Spider uses a configurable hybrid storage engine, which can enable it to crawl millions of URLs. However, it does require memory and storage configuration, as well as the recommended hardware.
By default the SEO Spider will crawl using RAM, rather than saving to disk. This has advantages, but it cannot crawl at scale, without lots of RAM allocated.
In standard memory storage mode there isn’t a set number of pages it can crawl, it is dependent on the complexity of the site and the users machine specifications. The SEO Spider sets a maximum memory of 1gb for 32-bit and 2gb for 64-bit machines, which enables it to crawl between 5k-100k URLs of a site.
You can increase the SEO Spider’s memory allocation, and crawl into hundreds of thousands of URLs purely using RAM. A 64-bit machine with 8gb of RAM will generally allow you to crawl a couple of hundred thousand URLs, if the memory allocation is increased.
The SEO Spider can be configured to save crawl data to disk, which enables it to crawl millions of URLs. However, we recommend using this option with a Solid State Drive (SSD), as hard disk drives are significantly slower at writing and reading data. This can be configured by selecting Database Storage mode (under ‘Configuration > System > Storage’).
As a rough guide, an SSD and 8gb of RAM in database storage mode, should allow the SEO Spider to crawl approx. 5 million URLs.
High Memory Usage
If you have received the following ‘high memory usage’ warning message when performing a crawl –
Or if you are experiencing slow down in a crawl or of the program itself on a large crawl, this might be due to reaching the memory allocation.
This is warning you that the SEO Spider has reached the current memory allocation and it needs to be increased to crawl more URLs. To do this, you should save the crawl via the ‘File > Save’ menu. You can then follow the instructions below to increase your memory allocation, before opening the saved crawl and resuming it again.
You’re able set memory allocation within the application itself by selecting ‘Configuration > System > Memory’.
The SEO Spider will communicate your physical memory installed on the system, and allow you to configure it quickly. We recommend setting the memory at 2gb below your maximum.
Please remember to restart the application for the changes to take place.
As discussed above, you can switch to database storage mode to increase the number of URLs that can be crawled. We recommend using a Solid State Drive (SSD) for this storage mode, and it can be quickly configured within the application (‘Configuration > System > Storage’).
We recommend this as the default storage for users with an SSD, and for crawling at scale. Database storage mode allows for more URLs to be crawled for a given memory setting, with close to RAM storage crawling speed for set-ups with a solid state drive (SSD).
The default crawl limit is 5 million URLs, but it isn’t a hard limit – the SEO Spider is capable of crawling significantly more (with the right set-up). As an example, a machine with a 500gb SSD and 16gb of RAM, should allow you to crawl up to 10 million URLs approximately.
While not recommended, if you have a fast hard disk drive (HDD), rather than a sold state disk (SSD), then this mode can still allow you to crawl more URLs. However, writing and reading speed of a hard drive does become the bottleneck in crawling – so both crawl speed, and the interface itself will be significantly slower.
If you’re working on the machine while crawling, it can also impact machine performance, so the crawl speed might require to be reduced to cope with the load. SSDs are so fast, they generally don’t have this problem and this is why ‘database storage’ can be used as the default for both small and large crawls.
Increasing memory on macOS 10.7.2 and earlier
Open ‘Finder’ and navigate to the ‘Applications’ folder, probably listed under ‘Favourites’, as below. Select ‘Screaming Frog SEO Spider’, right click and choose ‘Show Package Contents’.
Then expand the ‘Contents’ folder, select ‘Info.plist’, right click and choose ‘Open With’ and then ‘Other’.
In the resulting prompt menu, choose ‘TextEdit’.
Now find the following section and change the values appropriately (on line 30′ish) :
Edit the value below to change memory settings – -Xmx1024M for 1GB etc
Choose ‘File’ then ‘Save’, then ‘TextEdit’ and ‘Quit TextEdit’. Then re-launch the SEO Spider and your new memory settings will now be active.
You can verify you setting have taken affect by following the guide here.
Checking memory allocation
After updating your memory settings you can verify the changes have taken affect by going to Help->Debug and looking at the Memory line.
The SEO Spider uses 512MB by default, so you the line will look something like this:
Memory: Used 41MB Free 250 MB Total 292MB Max 455 MB Using 9%
The Max figure will always be a little less than the amount allocated. Allocating 2GB will look like this:
Memory: Used 33MB Free 263 MB Total 296MB Max 1820 MB Using 1%
Please note, the figures shown here aren’t exact as the VM overhead varies between Operating system and Java version.
XML sitemap creation
The Screaming Frog SEO Spider allows you to create an XML sitemap or a specific image XML sitemap, located under ‘Sitemaps’ in the top level navigation.
The ‘Create XML Sitemap’ feature allows you to create an XML Sitemap with all HTML 200 response pages discovered in a crawl, as well as PDFs and images. The ‘Create Images Sitemap’ is a little bit different to the ‘Create XML Sitemap’ option and including ‘images’. This option includes all images with a 200 response and ONLY pages that have images on them.
If you have over 49,999 urls the SEO Spider will automatically create additional sitemap files and create a sitemap index file referencing the sitemap locations. The SEO Spider conforms to the standards outlined in sitemaps.org protocol.
Read our detailed tutorial on how to use the SEO Spider as an XML Sitemap Generator, or continue below for a quick overview of each of the XML Sitemap configuration options.
Adjusting Pages To Include
By default, only HTML pages with a ‘200’ response from a crawl will be included in the sitemap, so no 3XX, 4XX or 5XX responses. Pages which are ‘noindex’, ‘canonicalised’ (the canonical URL is different to the URL of the page), paginated (URLs with a rel=“prev”) or PDFs are also not included as standard, but this can be adjusted within the XML Sitemap ‘pages’ configuration.
If you have crawled URLs which you don’t want included in the XML Sitemap export, then simply highlight them in the user interface, right click and ‘remove’ before creating the XML sitemap. Alternatively you can export the ‘internal’ tab to Excel, filter and delete any URLs that are not required and re-upload the file in list mode before exporting the sitemap. Alternatively, simply block them via the exclude feature or robots.txt before a crawl.
It’s optional whether to include the ‘lastmod’ attribute in a XML Sitemap, so this is also optional in the SEO Spider. This configuration allows you to either use the server response, or a custom date for all URLs.
‘Priority’ is an optional attribute to include in an XML Sitemap. You can ‘untick’ the ‘include priority tag’ box, if you don’t want to set the priority of URLs.
It’s optional whether to include the ‘changefreq’ attribute and the SEO Spider allows you to configure these based from the ‘last modification header’ or ‘level’ (depth) of the URLs. The ‘calculate from last modified header’ option means if the page has been changed in the last 24 hours, it will be set to ‘daily’, if not, it’s set as ‘monthly’.
It’s entirely optional whether to include images in the XML sitemap. If the ‘include images’ option is ticked, then all images under the ‘Internal’ tab (and under ‘Images’) will be included by default. As shown in the screenshot below, you can also choose to include images which reside on a CDN and appear under the ‘external’ tab within the UI.
Typically images like logos or social profile icons are not included in an image sitemap, so you can also choose to only include images with a certain number of source attribute references to help exclude these. Often images like logos are linked to sitewide, while images on product pages for example might only be linked to once of twice. There is a IMG Inlinks column in the ‘images’ tab which shows how many times an image is referenced to help decide the number of ‘inlinks’ which might be a suitable for inclusion.
There’s a variety of reports which can be accessed via the ‘reports’ top level navigation. These include, as follows –
Crawl Overview Report
This report provides a summary of the crawl, including data such as, the number of URLs encountered, those blocked by robots.txt, the number crawled, the content type, response codes etc. It provides a top level summary of the numbers within each of the tabs and respective filters.
The ‘total URI description’ provides information on what the ‘Total URI’ column number is for each individual line to (try and) avoid any confusion.
Redirect Chains Report
This report maps out chains of redirects, the number of hops along the way and will identify the source, as well as if there is a loop.
In Spider mode (Mode > Spider) this report will show all redirect chains of size 2 and above, thus showing where redirect chains can be optimised. It also flags redirect loops. If the reports is empty, it means you have no loops or redirect chains that can be shortened.
The redirect chain report can also be used in list mode (Mode > List). It will show a line for every URL supplied in the list. By ticking the ‘Always follow redirects‘ option the SEO Spider will continue to crawl redirects in list mode and ignore crawl depth, meaning it will report back upon all hops until the final destination. Please see our guide on auditing redirects in a site migration.
Please note – If you only perform a partial crawl, or some URLs are blocked via robots.txt, you may not receive all response codes for URLs in this report.
Canonical Errors Report
This report highlights errors and issues with canonicals. In particular, this report will show any canonicals which have a no response, 3XX redirect, 4XX or 5XX error (anything other than a 200 ‘OK’ response).
This report also provides data on any URLs which are discovered only via a canonical and are not linked to from the site (in the ‘unlinked’ column when ‘true’).
rel=”next” and rel=”prev” Errors Report
This report highlights errors and issues with rel=”next” and rel=”prev” attributes, which are of course used to indicate paginated content.
The report will show any rel=”next” and rel=”prev” URLs which have a no response, blocked by robots.txt, 3XX redirect, 4XX, or 5XX error (anything other than a 200 ‘OK’ response).
This report also provides data on any URLs which are discovered only via a rel=”next” and rel=”prev” attribute and are not linked-to from the site (in the ‘unlinked’ column when ‘true’).
There are 4 hreflang reports which allow data to be exported in bulk, which include the following –
- Errors – This report shows any hreflang attributes which are not a 200 response (no response, blocked by robots.txt, 3XX, 4XX or 5XX responses) or are unlinked on the site.
- Missing Confirmation Links – This report shows the page missing a confirmation link, and which page is not confirming.
- Inconsistent Language Confirmation Links – This report shows confirmation pages which use different language codes to the same page.
- Non Canonical Confirmation Links – This report shows the confirmation links which are to non canonical URLs.
Insecure Content Report
The insecure content report will show any secure (HTTPS) URLs which have insecure elements on them, such as internal HTTP links, images, JS, CSS, SWF or external images on a CDN, social profiles etc. When you’re migrating a website to secure (HTTPS) from non secure (HTTP), it can be difficult to pick up all insecure elements and this can lead to warnings in a browser –
Here’s a quick example of how a report might look (with insecure images in this case) –
SERP Summary Report
This report allows you to quickly export URLs, page titles and meta descriptions with their respective character lengths and pixel widths.
This report can also be used for a template to re-upload back into the SEO Spider in ‘SERP’ mode.
Orphan Pages Report
The orphan pages report (previously called ‘GA & GSC Not Matched Report in version 8.0 or below) provides a list of URLs collected from the Google Analytics API and the Google Search Console (Search Analytics API), that were not matched against URLs discovered within the crawl. Hence, this report will be blank, unless you have connected to Google Analytics or Search Console and collected data during a crawl.
The ‘source’ column shows exactly which API the URL was discovered, but not matched against a URL in the crawl. These include –
- GA – The URL was discovered via the Google Analytics API.
- GSC – The URL was discovered in Google Search Console, by the Search Analytics API.
- GA & GSC – The URL was discovered in both Google Analytics & Google Search Console.
This report can include any URLs returned by Google Analytics for the query you select in your Google Analytics configuration. Hence, this can include logged in areas, or shopping cart URLs, so often the most useful data for SEOs is returned by querying the landing page path dimension and ‘organic traffic’ segment. This can then help identify –
- Orphan Pages – These are pages that are not linked to internally on the website, but do exist. These might just be old pages, those missed in an old site migration or pages just found externally (via external links, or referring sites). This report allows you to browse through the list and see which are relevant and potentially upload via list mode.
- Errors – The report can include 404 errors, which sometimes include the referring website within the URL as well (you will need the ‘all traffic’ segment for these). This can be useful for chasing up websites to correct external links, or just 301 redirecting the URL which errors, to the correct page! This report can also include URLs which might be canonicalised or blocked by robots.txt, but are actually still indexed and delivering some traffic.
- GA or GSC URL Matching Problems – If data isn’t matching against URLs in a crawl, you can check to see what URLs are being returned via the GA or GSC API. This might highlight any issues with the particular Google Analytics view, such as filters on URLs, such as ‘extended URL’ hacks etc. For the SEO Spider to return data against URLs in the crawl, the URLs need to match up. So changing to a ‘raw’ GA view, which hasn’t been touched in anyway, might help.
Crawl Path Report
This report is not under the ‘reports’ drop down in the top level menu, it’s available upon right-click of a URL in the top window pane and then the ‘export’ option. For example –
This report shows you the shortest path the SEO Spider crawled to discover the URL which can be really useful for deep pages, rather than viewing ‘inlinks’ of lots of URLs to discover the original source URL (for example, for infinite URLs caused by a calendar).
The crawl path report should be read from bottom to top. The first URL at the bottom of the ‘source’ column is the very first URL crawled (with a ‘0’ level). The ‘destination’ shows which URLs were crawled next, and these make up the following ‘source’ URLs for the next level (1) and so on, upwards.
The final ‘destination’ URL at the very top of the report will be the URL of the crawl path report.
Command line & scheduling
You can use the command line to start a crawl. Please see our post How To Schedule A Crawl By Command Line In The SEO Spider for more information on scheduling a crawl.
Supplying no arguments starts the application as normal. Supplying a single argument of a file path, tries to load that file in as a saved crawl. Supplying the following:
starts the Spider and immediately triggers the crawl of the supplied domain. This switches the Spider to crawl mode if its not the last used mode and uses your default configuration for the crawl.
Note: If your last used mode was not crawl, “Ignore robots.txt” and “Limit Search Depth” will be overwritten.
Open a command prompt (Start button, then search programs and files for ‘Windows Command Processor’)
Move into the SEO Spider directory:
cd "C:\Program Files\Screaming Frog SEO Spider"
To start normally:
To open a crawl file (Only available to licensed users):
To auto start a crawl:
Open a terminal, found in the Utilities folder in the Applications folder, or directly using spotlight and typing: ‘terminal’.
To start normally:
open "/Applications/Screaming Frog SEO Spider.app"
To open a saved crawl file:
open "/Applications/Screaming Frog SEO Spider.app" /tmp/crawl.seospider
To auto start a crawl:
The following commands are available from the command line:
To start normally:
To open a saved crawl file:
To auto start a crawl:
screamingfrogseospider --crawl http://www.example.com/
The search box in the top right of the interface allows you to search all visible columns for a given regular expression. By default, the SEO Spider will only search the URL column, unless other columns are selected.
After typing the search, hit enter and only rows containing matching cells will be displayed. If you would like to make this search case insensitive, you can do so using the following regular expression:
This would match keyword, Keyword and KEYWORD.