SEO Spider General

Installation

The Screaming Frog SEO Spider can be downloaded by clicking on the appropriate download button for your operating system and then running the installer.

The SEO Spider is available for Windows, Mac and Ubuntu Linux. Version 8.0 is is the last version of the SEO Spider to support Windows XP.

The minimum specification is a machine with at least 1gb of RAM. The SEO Spider is capable of crawling millions of URLs with the correct hardware, memory and storage. It is able to save crawl data in RAM, or to disk in a database.

For crawls under 100-200k URLs a 64-bit OS and 8gb of RAM should be sufficient. However, to be able to crawl millions of URLs, an SSD and 16gb of RAM (or higher) is our recommended hardware.

Screaming Frog SEO Spider

Installation on macOS

These instructions were written using macOS 10.13 High Sierra and are valid for all recent versions of macOS.

Download the latest version of the SEO Spider. The downloaded file will be named ScreamingFrogSEOSpider-10.3.dmg or similar depending on version. The file will most likely download to your Downloads directory which can easily be accessed via Finder.

Installation

The downloaded file is a disk image containing the Screaming Frog SEO Spider application. Got to your Downloads folder in Finder, double click on the downloaded file and you will see the following screen.

Click on the Screaming Frog SEO Spider application icon on the left and drag it onto the Applications folder on the right. This copies the Screaming Frog SEO Spider application to your Applications folder, which is where most Applications live on macOS.

Now close this window by clicking the x in the top left. Go to Finder, look under the Devices section on the left, locate ScreamingFrogSEOSpider and click the eject icon next to it.

Running the SEO Spider

The SEO Spider can be run one of two ways.

GUI

Got to your Applications folder using Finder, locate the Screaming Frog SEO Spider icon and double click it to launch.

Command Line

If you would like to run via the command line, please see our User Guide.

Troubleshooting

If you get a message like this when opening the .dmg please reboot your mac and try again.

Installation on Ubuntu

These instructions are for installing on Ubuntu 18.04.1.

Download the latest version of the SEO Spider. The downloaded file will be called something like screamingfrogseospider_10.2_all.deb, and will most likely be in the Downloads folder in your home directory.

Installation

You can install the SEO Spider in one of two ways.

GUI

– Double click on the .deb file.
– Choose “Install” and enter your password.
– The SEO Spider requires the ttf-mscorefonts-install to be run, so accept the licence for this when it pops up.
– Wait for the installation to complete.

Command Line

Open up a terminal and enter the following command.

 sudo apt-get install ~/Downloads/screamingfrogseospider_10.4_all.deb

You will need enter your password, then enter Y when asked if you want to continue and accept the ttf-mscorefonts-install installations EULA.

Troubleshooting

  • E: Unable to locate package screamingfrogseospider_10.4_all.deb

    Please make sure you are entering an absolute path to the .deb to install as per the example.

  •  Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/u/somepackage.deb 404 Not Found [IP: 91.189.88.149 80]

    Please run the following and try again.

    sudo apt-get update

Running the SEO Spider

Irrespective of how the SEO Spider was installed you can run it one of two ways.

GUI

Click the Apps icon in the bottom left hand corner, type “seo spider” and click on the applications icon when it appears.

Command Line

Enter the following command in a terminal:

screamingfrogseospider

For more command line options see our User Guide.

Crawling

The Screaming Frog SEO Spider is free to download and use for crawling up to 500 URLs at a time.

For £149 a year you can buy a licence, which removes the 500 URL crawl limit.

A licence also provides access to all the configuration options, saving & opening crawls, JavaScript rendering, custom source code search, custom extraction, Google Analytics integration, Google Search Console integration, external link metrics integration, AMP crawling & validation and lots more!

Crawling A Website (Subdomain)

In regular crawl mode, the SEO Spider will crawl the subdomain you enter and treat all other subdomains it encounters as external links by default (these appear under the ‘external’ tab).

For example, by entering https://www.screamingfrog.co.uk in the ‘Enter URL to spider’ box at the top and clicking ‘Start’, the Screaming Frog www. subdomain will be crawled.

crawl a website

In the licenced version of the software, you can adjust the configuration to choose to crawl all subdomains of a website, if there are multiple. If you start a crawl from the root (e.g. https://screamingfrog.co.uk), the SEO Spider will by default crawl all subdomains as well.

One of the most common uses of the SEO Spider is to find errors on a website, such as broken links, redirects and server errors. Please read our guide on how to find broken links, which explains how to view the source of errors such as 404s, and export the source data in bulk to a spread sheet.

For better control of your crawl, use the URL structure of your website by crawling a subfolder, the SEO Spiders configuration options such as crawling only HTML (images, CSS, JS etc), the exclude function, the custom robots.txt, the include function or alternatively change the mode of the SEO Spider and upload a list of URLs to crawl.

Crawling A Subfolder

The SEO Spider tool crawls from subfolder path forwards by default. Simply enter the full subfolder URL to crawl it.

For example, if it’s a blog, it might be – https://www.screamingfrog.co.uk/blog/. By entering this directly into the SEO Spider, it will crawl all URLs contained within the /blog/ subfolder.

crawling a subfolder

You may notice some URLs which are not within the /blog/ subfolder are crawled as well by default. This will be due to the ‘check links outside of start folder‘ configuration.

This configuration allows the SEO Spider to focus it’s crawl within the /blog/ directory, but still crawl links that are not within this directory, when they are linked to from inside it. However, it will not crawl any further onwards. This is useful as you may wish to find broken links that sit within the /blog/ subfolder, but don’t have /blog/ within the URL structure. To only crawl URLs with /blog/, simply untick this configuration.

If there isn’t a trailing slash on the end of the subfolder, for example ‘/blog’ instead of ‘/blog/’, the SEO Spider won’t recognise it as a subfolder and crawl within it. If the trailing slash version of a subfolder redirects to a non trailing slash version, then the same applies.

To crawl this subfolder, you’ll need to use the include feature and input the regex of that subfolder (.*blog.* in this example).

If you have a more complicated set up like subdomains and subfolders you can specify both. For example – http://de.example.com/uk/ to Spider the .de subdomain and UK subfolder etc.

Crawling A List Of URLs

As well as crawling a website by entering a URL and clicking ‘Start’, you can switch to list mode and either paste or upload a list of specific URLs to crawl.

list mode

This can be particularly useful for site migrations when auditing URLs and redirects for example. We recommend reading our guide on ‘How To Audit Redirects In A Site Migration‘ for the best approach.

If you wish to export data in list mode in the same order it was uploaded, then use the ‘Export’ button which appears next to the ‘upload’ and ‘start’ buttons at the top of the user interface.

The data in the export will be in the same order and include all of the exact URLs in the original upload, including duplicates or any fix-ups performed

Crawling Larger Websites

If you wish to perform a particularly large crawl, we recommend increasing the RAM memory allocation in the SEO Spider first.

Allocate Memory

If you receive ‘you are running out of memory for this crawl’ warning, then you will need to save the crawl, increase the RAM allocation and consider switching to database storage mode to save to disk, open the crawl and resume the crawl.

The number of URLs the SEO Spider can crawl is down to the amount of memory available on the machine and whether it’s allocated, and whether you’re crawling in default memory storage, or database storage mode.

database storage mode

For really large crawls, have a read of our guide on how to crawl large websites which provides an overview of the available options.

You may wish to consider breaking up crawls into smaller sections and using the configuration to control your crawl. Some options include –

These should all help save memory and focus the crawl on the important areas you require. Please see our more in depth guide on how to crawl large websites.

Saving & uploading crawls

In the licensed version of the tool you can save your crawls and open them back into the SEO Spider. The files are saved as a .seospider file type specific to the Screaming Frog SEO Spider.

You can save crawls part way through by stopping the SEO Spider and selecting ‘File > Save’.

saving crawls

To open a crawl, simply double click on the relevant .seospider filder, choose ‘File > Open’ or choose one of your recent crawls under ‘File > Open Recent’. You can then resume the crawl if saved part way through.

Please note, saving and opening crawls can take a number of minutes or much longer, depending on the size of the crawl and amount of data.

Configuration

In the licensed version of the tool you can save a default crawl configuration, and save configuration profiles, which can be loaded when required.

To save the current configuration as default choose ‘File > Configuration > Save Current Configuration As Default’.

To save the configuration profile to be able to load in the future, click ‘File > Save As’ and adjust the file name (ideally to something descriptive!).

configuration profiles

To load a configuration profile, click ‘File > Load’ and choose your configuration profile, or ‘File > Load Recent’ to select from a recent list.

To reset back to the original SEO Spider default configuration choose ‘File > Configuration > Clear Default Configuration’.

Scheduling

You’re able to schedule crawls to run automatically within the SEO Spider, as a one-off, or at chosen intervals. This feature can be found under ‘File > Scheduling’ within the app.

Screaming Frog Scheduler

You’re able to pre select the mode (spider, or list), saved configuration, as well as APIs (Google Analytics, Search Console, Majestic, Ahrefs, Moz) to pull in any data for the scheduled crawl.

scheduling start options

You can also automatically save the crawl file and export any of the tabs, bulk exports, reports or XML Sitemaps to a chosen location.

scheduling export options

A new instance of the SEO Spider is started for a scheduled crawl. So if there is an overlap of crawls, multiple instances of the SEO Spider will run at the same time, rather than there being a delay until the previous crawl has completed. Hence, we recommend considering your system resources and timing of crawls appropriately.

Please note – The SEO Spider will run in headless mode (meaning without an interface) when scheduled to export data. This is to avoid any user interaction or the application starting in front of you and options being clicked, which would be a little strange.

This scheduling is within the user interface, if you’d prefer to use the command line to operate the SEO Spider, please see our command line interface guide.

Exporting

You can export all data from a crawl, including bulk exporting inlink and outlink data. There are three main methods to export data outlined below.

Exporting Tabs & Filters (Top Window Data)

Simply click the ‘export’ button in the top left hand corner to export data from the top window tabs and filters.

Top Window Export

The export function in the top window section works with your current field of view in the top window. Hence, if you are using a filter and click ‘export’ it will only export the data contained within the filtered option.

Exporting Lower Window Data (URL Info, Inlinks, Outlinks, Image Info or Crawl Path Report)

To export lower window data, simply right click on the URL that you wish to export data from in the top window, then click on one of the options.

right click export of lower window data

Bulk Export

The ‘Bulk Export’ is located under the top level menu and allows bulk exporting of all data. You can export all instances of a link found in a crawl via the ‘All inlinks’ option, or export all inlinks to URLs with specific status codes such as 2XX, 3XX, 4XX or 5XX responses.

For example, selecting the ‘Client Error 4XX In Links’ option will export all inlinks to all error pages (such as 404 error pages). You can also export all image alt text, all images missing alt text and all anchor text across the site.

bulk export crawl data

You can also view our video guide about exporting from the SEO Spider –

Bulk Export Options

The following export options are available under the ‘bulk export’ top level menu.

  • All Inlinks: Links to every URI the SEO Spider encountered while crawling the site. This contains every link to all the URI (not just ahref, but also to images, canonical, hreflang, rel next/prev etc) in the All filter of the Response Codes tab.
  • All Outlinks: All links the SEO Spider encountered during crawling. This will contain every link contained in every URI in the Response Codes tab in the All filter.
  • All Anchor Text: All ahref links to URIs in the All filter in the Response Codes tab.
  • All Images: All references to Images in the All filter of the Images tab.
  • Screenshots: An export of all the screenshots seen in the ‘Rendered Page‘ lower window tab, stored when using JavaScript rendering mode.
  • All Page Source: The static HTML source or rendered HTML of crawled pages. Rendered HTML is only available when in JavaScript rendering mode.
  • External Links: All links to URI found under the All filter of the External tab.
  • Response Codes: All links to the URIs in the corresponding filter of the Response Codes tab. e.g. All source links to URLs that respond with 404 errors on the site.
  • Directives: All links to the URIs in the corresponding filter of the Directives tab. e.g. Links to all the pages on the site that contain a meta robots ‘noindex’ tag.
  • Canonicals: All links to the URIs in the corresponding filter of the Canonicals tab. e.g. Links which have missing canonicals.
  • AMP: All links to the URIs in the corresponding filter of the AMP tab. e.g. Pages which have amphtml links with non-200 responses.
  • Images: All references to the image URIs in the corresponding filter of the Images tab. e.g. All the references to images that are missing alt text.
  • Sitemaps: All references to the image URIs in the corresponding filter of the Sitemaps tab. e.g. All XML Sitemaps which contain non-indexable URLs.
  • Custom: All links to the URIs in the corresponding filter of the Custom tab. e.g. Links to all the pages on the site that matched a Custom Search.

Robots.txt

The Screaming Frog SEO Spider is robots.txt compliant. It obeys robots.txt in the same way as Google.

It will check the robots.txt of the subdomain(s) and follow (allow/disallow) directives specifically for the Screaming FrogSEO Spider user-agent, if not Googlebot and then ALL robots. It will follow any directives for Googlebot currently as default. Hence, if certain pages or areas of the site are disallowed for Googlebot, the SEO Spider will not crawl them either. The tool supports URL matching of file values (wildcards * / $), just like Googlebot, too.

You can choose to ignore the robots.txt (it won’t even download it) in the paid (licenced) version of the software by selecting ‘Configuration > robots.txt > Settings > Ignore robots.txt’.

You can also view URLs blocked by robots.txt under the ‘Response Codes’ tab and ‘Blocked by Robots.txt’ filter. This will also show the matched robots.txt line of the disallow against each blocked URL.

Finally, there is also a custom robots.txt configuration, which allows you to download, edit and test a site’s robots.txt under ‘Configuration > robots.txt > Custom’. Please read our user guide about using the Screaming Frog SEO Spider as a robots.txt tester.

A few things to remember here  –

  • The SEO Spider only follows one set of user agent directives as per robots.txt protocol. Hence, priority is the Screaming FrogSEO Spider UA if you have any. If not, the SEO Spider will follow commands for the Googlebot UA, or lastly the ‘ALL’ or global directives.
  • To reiterate the above, if you specify directives for the Screaming Frog SEO Spider, or Googlebot then the ALL (or ‘global’) bot commands will be ignored. If you want the global directives to be obeyed, then you will have to include those lines under the specific UA section for the SEO Spider or Googlebot.
  • If you have conflicting directives (ie an allow and disallow to the same file path) then a matching allow directive beats a matching disallow if it contains equal or more characters in the command.
  • If the Robots User Agent is left blank the SEO Spider will only obey the rules for * if present.

User agent

The SEO Spider obeys robots.txt protocol. Its user agent is ‘Screaming Frog SEO Spider’ so you can include the following in your robots.txt if you wish to block it –

User-agent: Screaming Frog SEO Spider

Disallow: /

Or alternatively if you wish to exclude certain areas of your site specifically for the SEO Spider, simply use the usual robots.txt syntax with our user-agent. Please note – There is an option to ‘ignore robots.txt’, which is down to the responsibility of the user entirely.

Memory

The Screaming Frog SEO Spider uses a configurable hybrid storage engine, which can enable it to crawl millions of URLs. However, it does require memory and storage configuration, as well as the recommended hardware.

By default the SEO Spider will crawl using RAM, rather than saving to disk. This has advantages, but it cannot crawl at scale, without lots of RAM allocated.

In standard memory storage mode there isn’t a set number of pages it can crawl, it is dependent on the complexity of the site and the users machine specifications. The SEO Spider sets a maximum memory of 1gb for 32-bit and 2gb for 64-bit machines, which enables it to crawl between 5k-100k URLs of a site.

You can increase the SEO Spider’s memory allocation, and crawl into hundreds of thousands of URLs purely using RAM. A 64-bit machine with 8gb of RAM will generally allow you to crawl a couple of hundred thousand URLs, if the memory allocation is increased.

The SEO Spider can be configured to save crawl data to disk, which enables it to crawl millions of URLs. However, we recommend using this option with a Solid State Drive (SSD), as hard disk drives are significantly slower at writing and reading data. This can be configured by selecting Database Storage mode (under ‘Configuration > System > Storage’).

As a rough guide, an SSD and 8gb of RAM in database storage mode, should allow the SEO Spider to crawl approx. 5 million URLs.

High Memory Usage

If you have received the following ‘high memory usage’ warning message when performing a crawl –

High Memory Usage Warning

Or if you are experiencing slow down in a crawl or of the program itself on a large crawl, this might be due to reaching the memory allocation.

This is warning you that the SEO Spider has reached the current memory allocation and it needs to be increased to crawl more URLs. To do this, you should save the crawl via the ‘File > Save’ menu. You can then follow the instructions below to increase your memory allocation, before opening the saved crawl and resuming it again.

Increasing Memory

You’re able set memory allocation within the application itself by selecting ‘Configuration > System > Memory’.

Memory allocation

The SEO Spider will communicate your physical memory installed on the system, and allow you to configure it quickly. We recommend setting the memory at 2gb below your maximum.

Please remember to restart the application for the changes to take place.

You can verify you setting have taken affect by following the guide here.

Database Storage

As discussed above, you can switch to database storage mode to increase the number of URLs that can be crawled. We recommend using a Solid State Drive (SSD) for this storage mode, and it can be quickly configured within the application (‘Configuration > System > Storage’).

database storage mode

We recommend this as the default storage for users with an SSD, and for crawling at scale. Database storage mode allows for more URLs to be crawled for a given memory setting, with close to RAM storage crawling speed for set-ups with a solid state drive (SSD).

The default crawl limit is 5 million URLs, but it isn’t a hard limit – the SEO Spider is capable of crawling significantly more (with the right set-up). As an example, a machine with a 500gb SSD and 16gb of RAM, should allow you to crawl up to 10 million URLs approximately.

While not recommended, if you have a fast hard disk drive (HDD), rather than a sold state disk (SSD), then this mode can still allow you to crawl more URLs. However, writing and reading speed of a hard drive does become the bottleneck in crawling – so both crawl speed, and the interface itself will be significantly slower.

If you’re working on the machine while crawling, it can also impact machine performance, so the crawl speed might require to be reduced to cope with the load. SSDs are so fast, they generally don’t have this problem and this is why ‘database storage’ can be used as the default for both small and large crawls.

Increasing memory on macOS 10.7.2 and earlier

If you’re using not using SEO Spider version 2.40 then please follow these instructions.

Open ‘Finder’ and navigate to the ‘Applications’ folder, probably listed under ‘Favourites’, as below. Select ‘Screaming Frog SEO Spider’, right click and choose ‘Show Package Contents’.

show package contents osx

Then expand the ‘Contents’ folder, select ‘Info.plist’, right click and choose ‘Open With’ and then ‘Other’.

other text edit

In the resulting prompt menu, choose ‘TextEdit’.

text edit

Now find the following section and change the values appropriately (on line 30′ish) :

VMOptions
Edit the value below to change memory settings – -Xmx1024M for 1GB etc
-Xmx512M

Choose ‘File’ then ‘Save’, then ‘TextEdit’ and ‘Quit TextEdit’. Then re-launch the SEO Spider and your new memory settings will now be active.

You can verify you setting have taken affect by following the guide here.

Checking memory allocation

After updating your memory settings you can verify the changes have taken affect by going to Help->Debug and looking at the Memory line.

The SEO Spider by either 1gb (32-bit) or 2gb (64-bit) by default, so you the line will look something like this:

Memory: Physical=16.0GB, Used=170MB, Free=85MB, Total=256MB, Max=2048MB, Using 8%

The Max figure will always be a little less than the amount allocated. Allocating 4GB will look like this:

Memory: Physical=16.0GB, Used=162MB, Free=93MB, Total=256MB, Max=4096MB, Using 3%

Please note, the figures shown here aren’t exact as the VM overhead varies between Operating system and Java version.

Troubleshooting
If changing memory settings has no effect on this output please check you don’t have a _JAVA_OPTS environment variable set.

Cookies

By default the Screaming Frog SEO Spider does not accept cookies, just like search engine bots.

However, under ‘Configuration > Spider’ in the ‘Advanced’ tab, there is an option to accept cookies. This is useful for crawling sites that mandate the client accepts cookies to be able to crawl it.

Cookies are stored per crawl and shared between crawler threads. Cookies are not stored when a crawl is saved, so resuming crawls from a saved .seospider file will not maintain the cookies used previously.

Cookies are reset at the start of new crawl. It is not possible to modify or set cookies manually.

XML sitemap creation

The Screaming Frog SEO Spider allows you to create an XML sitemap or a specific image XML sitemap, located under ‘Sitemaps’ in the top level navigation.

create an XML Sitemap

The ‘Create XML Sitemap’ feature allows you to create an XML Sitemap with all HTML 200 response pages discovered in a crawl, as well as PDFs and images. The ‘Create Images Sitemap’ is a little bit different to the ‘Create XML Sitemap’ option and including ‘images’. This option includes all images with a 200 response and ONLY pages that have images on them.

If you have over 49,999 urls the SEO Spider will automatically create additional sitemap files and create a sitemap index file referencing the sitemap locations. The SEO Spider conforms to the standards outlined in sitemaps.org protocol.

Read our detailed tutorial on how to use the SEO Spider as an XML Sitemap Generator, or continue below for a quick overview of each of the XML Sitemap configuration options.

Adjusting Pages To Include

By default, only HTML pages with a ‘200’ response from a crawl will be included in the sitemap, so no 3XX, 4XX or 5XX responses. Pages which are ‘noindex’, ‘canonicalised’ (the canonical URL is different to the URL of the page), paginated (URLs with a rel=“prev”) or PDFs are also not included as standard, but this can be adjusted within the XML Sitemap ‘pages’ configuration.

pages to include in the xml sitemap

If you have crawled URLs which you don’t want included in the XML Sitemap export, then simply highlight them in the user interface, right click and ‘remove’ before creating the XML sitemap. Alternatively you can export the ‘internal’ tab to Excel, filter and delete any URLs that are not required and re-upload the file in list mode before exporting the sitemap. Alternatively, simply block them via the exclude feature or robots.txt before a crawl.

Last Modified

It’s optional whether to include the ‘lastmod’ attribute in a XML Sitemap, so this is also optional in the SEO Spider. This configuration allows you to either use the server response, or a custom date for all URLs.

last modified

Priority

‘Priority’ is an optional attribute to include in an XML Sitemap. You can ‘untick’ the ‘include priority tag’ box, if you don’t want to set the priority of URLs.

xml sitemap priority

Change Frequency

It’s optional whether to include the ‘changefreq’ attribute and the SEO Spider allows you to configure these based from the ‘last modification header’ or ‘level’ (depth) of the URLs. The ‘calculate from last modified header’ option means if the page has been changed in the last 24 hours, it will be set to ‘daily’, if not, it’s set as ‘monthly’.

change frequency

Images

It’s entirely optional whether to include images in the XML sitemap. If the ‘include images’ option is ticked, then all images under the ‘Internal’ tab (and under ‘Images’) will be included by default. As shown in the screenshot below, you can also choose to include images which reside on a CDN and appear under the ‘external’ tab within the UI.

image xml sitemap

Typically images like logos or social profile icons are not included in an image sitemap, so you can also choose to only include images with a certain number of source attribute references to help exclude these. Often images like logos are linked to sitewide, while images on product pages for example might only be linked to once of twice. There is a IMG Inlinks column in the ‘images’ tab which shows how many times an image is referenced to help decide the number of ‘inlinks’ which might be a suitable for inclusion.

Visualisations

There are two main types of visualisations available within the top level menu of the SEO Spider, crawl visualisations and directory tree visualisations.

Crawl visualisations are useful for viewing internal linking, while directory tree visualisations are more useful for understanding URL structure and organisation.

There are a force-directed diagram and tree graph version of each type of visualisation available via the ‘Visualisations’ top-level menu.

SEO visualisations

There are also a couple of extra visualisations available via a right-click, which include an ‘Inlink Anchor Text Word Cloud’ and ‘Body Text Word Cloud’.

right click visualisations

Crawl Visualisations

Crawl visualisations include the ‘Force-Directed Crawl Diagram’ and ‘Crawl Tree Graph’.

These crawl visualisations are useful for analysis of internal linking, as they provide a view of how the SEO Spider has crawled the site, by the first link to a page. If there are multiple shortest paths to a page (i.e. a URL is linked to from two URLs at the same crawl depth), then the first URL crawled (often the first in the source) is used.

force directed crawl diagram

If you click on the ‘i’ symbol in the top right-hand corner of the visualisation, it explains what each colour represents.

Indexable pages are represented by the green nodes, the darkest, largest circle in the middle is the start URL (the homepage), and those surrounding it are the next level deep, and they get further away, smaller and lighter with increasing crawl depth (like a heatmap).

The pastel red highlights URLs that are non-indexable, which makes it quite easy to spot problematic areas of a website. There are valid reasons for non-indexable pages, but visualising their proportion and where they are, can be useful in quickly identifying areas of interest to investigate further.

The visualisation will show up to 10k URLs at a time, as they are extremely memory intensive. However, you are able to view them from any URL to view from that point.

You’re also able to right-click and ‘focus’ to expand on particular areas of a site to show more URLs in that section (up to another 10k URLs at a time). You can use the browser as navigation, typing in a URL directly, and moving forwards and backwards.

force directed crawl digram right click focus

You can also type a URL directly into the browser, or right-click on any URL in a crawl, and open up a visualisation from that point as a visual URL explorer.

right click visualisations

When a visualisation has reached the 10k URL limit, it lets you know when a particular node has children that are being truncated (due to size limits), by colouring the nodes grey. You can then right click and ‘explore’ to see the children. This way, every URL in a crawl can be visualised.

right click focus to view truncated children

You’re also able to configure the size of nodes, overlap, separation, colour, link length and when to display text.

visualisation configuration

You’re therefore able to produce colourful visualisations, like the below.

Pretty force-directed crawl diagram

You’re also able to scale visualisations by other metrics to provide greater insight, such as unique inlinks, word count, GA Sessions, GSC Clicks, Link Score, Moz Page Authority and more.

The size and colour of nodes will scale based upon these metrics, which can help visualise many different things alongside internal linking, such as sections of a site which might have thin content.

low content using force-directed diagram

Or highest value by link score.

link score using force-directed diagram

You can also view internal linking in a more simplistic crawl tree graph, which can be configured to display left to right, or top to bottom.

crawl tree visualisation

You can right click and ‘focus’ on particular areas of the site. You can also expand or collapse up to a particular crawl depth, and adjust the level and node spacing.

crawl tree visualisation focused

Like the force-directed diagrams, all the colours can also be adjusted.

Directory Tree Visualisations

Directory Tree visualisations include the ‘Force-Directed Directory Tree Diagram’ and ‘Directory Tree Graph’.

The ‘Directory Tree’ view helps to understand a site’s URL architecture, and the way it’s organised, as opposed to internal linking of the crawl visualisations. This can be useful, as these groupings often share the same page templates and SEO issues.

The force-directed directory tree diagram is unique to the SEO Spider (and you can see it’s very different for a crawl of our site than the previous crawl diagram), and easier to visualise potential problems.

force-directed directory tree diagram

Notice how the non-indexable red nodes are organised together, as they have the same template, whereas in the crawl diagram they are distributed throughout. This view often makes it easier to see patterns.

This can also be viewed in a simplistic directory tree graph format, too. These graphs are interactive and here’s a zoomed in, top-down view of a section of our website.

directory tree graph

It’s important to remember that nodes don’t always represent a URL in this directory tree view. They can merely represent a path, which doesn’t exist as a URL. An example, of this, is the subfolder /author/ for the Screaming Frog website. It has URLs contained within the subfolder (/author/name/) which exist, but the /author/ path itself doesn’t.

However, in a directory tree view, this is still shown to enable grouping. However, only the ‘path’ is shown when you hover over it –

directory tree path

A URL will contain more information like this –

directory tree path real URL

Inlink Anchor Text & Body Text Word Clouds

These are available via right-clicking on a URL and going to ‘Visualisations’.

inlink anchor text word cloud

The ‘inlink anchor text word cloud’ includes all internal anchor text to a given URL and image alt text of hyperlinked images to a page.

The ‘body text word cloud’ includes all text within the HTML body of a page. To view this visualisation, the ‘Store HTML‘ configuration must be enabled.

Crawl Analysis

The SEO Spider usually analyses and reports data at run-time, where metrics, tabs and filters are populated during a crawl. However, ‘Link Score’ and a small number of filters require calculation at the end of a crawl (or when a crawl has been stopped).

The full list of items that require ‘crawl analysis’ can be viewed below, and seen under ‘Crawl Analysis > Configure’.

crawl analysis

All of the above are filters under their respective tabs, apart from ‘Link Score’, which is a metric and shown as a column in the ‘Internal’ tab.

In the right hand ‘overview’ window pane, filters which require post ‘crawl analysis’ are marked with ‘Crawl Analysis Required’ for further clarity. The ‘Sitemaps’ filters in particular, mostly require post-crawl analysis.

Right hand overview crawl analysis required

They are also marked as ‘You need to perform crawl analysis for this tab to populate this filter’ within the main window pane.

Crawl Analysis tabs message

This analysis can be automatically performed at the end of a crawl by ticking the respective ‘Auto Analyse At End of Crawl’ tickbox under ‘Configure’, or it can be run manually by the user.

To run the crawl analysis, simply click ‘Crawl Analysis > Start’.

Start Crawl Analysis

When the crawl analysis is running you’ll see the ‘analysis’ progress bar with a percentage complete. The SEO Spider can continue to be used as normal during this period.

Crawl Analysis Running

When the crawl analysis has finished, the empty filters which are marked with ‘Crawl Analysis Required’, will be populated with lots of lovely insightful data.

Filter populated after crawl analysis

Please note – The Analytics and Search Console orphan URLs filters will only be populated if you have connected to their respective APIs and chosen to ‘Crawl New URLs Discovered in Google Analytics/Google Search Console’ under their ‘general’ tabs. Otherwise, orphan URLs will only be viewable under ‘Reports > Orphan Pages’.

Reports

There’s a variety of reports which can be accessed via the ‘reports’ top level navigation. These include, as follows –

Crawl Overview Report

This report provides a summary of the crawl, including data such as, the number of URLs encountered, those blocked by robots.txt, the number crawled, the content type, response codes etc. It provides a top level summary of the numbers within each of the tabs and respective filters.

The ‘total URI description’ provides information on what the ‘Total URI’ column number is for each individual line to (try and) avoid any confusion.

Redirect & Canonical Chains Report

This report maps out chains of redirects and canonicals, the number of hops along the way and will identify the source, as well as if there is a loop.

In Spider mode (Mode > Spider) this report will show all redirects, from a single hop upwards. It will communicate the ‘number of redirects’ in a column and the ‘type of chain’ identified, whether it’s an HTTP Redirect, JavaScript Redirect, Canonical etc. It also flags redirect loops. If the reports is empty, it means you have no loops or redirect chains that can be shortened.

The redirect chain report can also be used in list mode (Mode > List). It will show a line for every URL supplied in the list. By ticking the ‘Always follow redirects‘ and ‘always follow canonicals‘ options the SEO Spider will continue to crawl redirects and canonicals in list mode and ignore crawl depth, meaning it will report back upon all hops until the final destination. Please see our guide on auditing redirects in a site migration.

Please note – If you only perform a partial crawl, or some URLs are blocked via robots.txt, you may not receive all response codes for URLs in this report.

Non-Indexable Canonicals Report

This report highlights errors and issues with canonicals. In particular, this report will show any canonicals which have a no response, are blocked by robots.txt, 3XX redirect, 4XX or 5XX error (anything other than a 200 ‘OK’ response).

This report also provides data on any URLs which are discovered only via a canonical and are not linked to from the site (in the ‘unlinked’ column when ‘true’).

Pagination Reports

The ‘Non-200 Pagination URLs’ and ‘Unlinked Pagination URLs’ reports highlight errors and issues with rel=”next” and rel=”prev” attributes, which are of course used to indicate paginated content.

The ‘Non-200 Pagination URLs’ report will show any rel=”next” and rel=”prev” URLs which have a no response, blocked by robots.txt, 3XX redirect, 4XX, or 5XX error (anything other than a 200 ‘OK’ response).

The ‘Unlinked Pagination URLs’ report provides data on any URLs which are discovered only via a rel=”next” and rel=”prev” attribute and are not linked-to from the site (in the ‘unlinked’ column when ‘true’).

Hreflang Reports

There are 4 hreflang reports which allow data to be exported in bulk, which include the following –

  • Non-200 Hreflang URLs – This report shows any hreflang attributes which are not a 200 response (no response, blocked by robots.txt, 3XX, 4XX or 5XX responses).
  • Unlinked Hreflang URLs – This report shows any hreflang URLs which are not linked to via a hyperlink on the site.
  • Missing Confirmation Links – This report shows the page missing a confirmation link, and which page is not confirming.
  • Inconsistent Language Confirmation Links – This report shows confirmation pages which use different language codes to the same page.
  • Non Canonical Confirmation Links – This report shows the confirmation links which are to non canonical URLs.
  • Noindex Confirmation Links – This report shows the confirmation links which are to noindex URLs.

Insecure Content Report

The insecure content report will show any secure (HTTPS) URLs which have insecure elements on them, such as internal HTTP links, images, JS, CSS, SWF or external images on a CDN, social profiles etc. When you’re migrating a website to secure (HTTPS) from non secure (HTTP), it can be difficult to pick up all insecure elements and this can lead to warnings in a browser –

Firefox insecure content warning

Here’s a quick example of how a report might look (with insecure images in this case) –

insecure content report

SERP Summary Report

This report allows you to quickly export URLs, page titles and meta descriptions with their respective character lengths and pixel widths.

This report can also be used for a template to re-upload back into the SEO Spider in ‘SERP’ mode.

Orphan Pages Report

The orphan pages report provides a list of URLs collected from the Google Analytics API, Google Search Console (Search Analytics API) and XML Sitemap that were not matched against URLs discovered within the crawl.

This report will be blank, unless you have connected to Google Analytics, Search Console or configured to crawl an XML Sitemap and pull in data during a crawl.

The ‘source’ column shows exactly the source the URL was discovered, but not matched against a URL in the crawl. These include –

  • GA – The URL was discovered via the Google Analytics API.
  • GSC – The URL was discovered in Google Search Console, by the Search Analytics API.
  • Sitemap – The URL was discovered via the XML Sitemap.
  • GA & GSC & Sitemap – The URL was discovered in Google Analytics, Google Search Console & the XML Sitemap.

This report can include any URLs returned by Google Analytics for the query you select in your Google Analytics configuration. Hence, this can include logged in areas, or shopping cart URLs, so often the most useful data for SEOs is returned by querying the landing page path dimension and ‘organic traffic’ segment. This can then help identify –

  • Orphan Pages – These are pages that are not linked to internally on the website, but do exist. These might just be old pages, those missed in an old site migration or pages just found externally (via external links, or referring sites). This report allows you to browse through the list and see which are relevant and potentially upload via list mode.
  • Errors – The report can include 404 errors, which sometimes include the referring website within the URL as well (you will need the ‘all traffic’ segment for these). This can be useful for chasing up websites to correct external links, or just 301 redirecting the URL which errors, to the correct page! This report can also include URLs which might be canonicalised or blocked by robots.txt, but are actually still indexed and delivering some traffic.
  • GA or GSC URL Matching Problems – If data isn’t matching against URLs in a crawl, you can check to see what URLs are being returned via the GA or GSC API. This might highlight any issues with the particular Google Analytics view, such as filters on URLs, such as ‘extended URL’ hacks etc. For the SEO Spider to return data against URLs in the crawl, the URLs need to match up. So changing to a ‘raw’ GA view, which hasn’t been touched in anyway, might help.

Crawl Path Report

This report is not under the ‘reports’ drop down in the top level menu, it’s available upon right-click of a URL in the top window pane and then the ‘export’ option. For example –

Crawl Path Report

This report shows you the shortest path the SEO Spider crawled to discover the URL which can be really useful for deep pages, rather than viewing ‘inlinks’ of lots of URLs to discover the original source URL (for example, for infinite URLs caused by a calendar).

The crawl path report should be read from bottom to top. The first URL at the bottom of the ‘source’ column is the very first URL crawled (with a ‘0’ level). The ‘destination’ shows which URLs were crawled next, and these make up the following ‘source’ URLs for the next level (1) and so on, upwards.

The final ‘destination’ URL at the very top of the report will be the URL of the crawl path report.

Command Line Interface Set-Up

If you are running on a platform that won’t allow you to run the User Interface at all, then you’ll need to follow the instructions in this guide before running the SEO Spider via the Command Line.

If you can run the User Interface, please do so before running on the Command Line. This will allow you to accept the End User Licence Agreement (EULA), enter your licence key and select a storage mode.

When the User Interface is not available to perform the initial run, you have to edit a few configuration files. The location of these varies depending on platform:

Windows

C:\Users\USERNAME\.ScreamingFrogSEOSpider\

macOS:

~/.ScreamingFrogSEOSpider/

Ubuntu:

~/.ScreamingFrogSEOSpider/

From now on we’ll refer to this as the .ScreamingFrogSEOSpider directory.

Entering you licence key

Create a file in your .ScreamingFrogSEOSpider directory called licence.txt. Enter (copy and paste to avoid typos) your license username on the first line and license key on the second line and save the file.

Accepting the EULA

Create or edit the file spider.config in your .ScreamingFrogSEOSpider directory. Locate and edit or add the following line:

eula.accepted=8

save the file and exit.

Choosing Storage Mode

The default storage mode is memory. If you are happy to use memory storage you don’t need to change anything. To change to database storage mode edit the file spider.config in your .ScreamingFrogSEOSpider directory. Add or edit the storage.mode property to be:

storage.mode=DB

The default path is a directory called db in your .ScreamingFrogSEOSpider directory. If you would like to change this add or edit the storage.db_dir property. Depending on your OS the path will have to be entered differently.

Windows:

storage.db_dir=C\:\\Users\\USERNAME\\dbfolder

macOS:

storage.db_dir=/Users/USERNAME/dbfolder

Ubuntu:

storage.db_dir=/home/USERNAME/dbdir

Command Line Interface

You’re able to operate the SEO Spider entirely via command line.

This includes launching, full configuration, saving and exporting of almost any data and reporting. The SEO Spider can also be run headless using the CLI.

This guide provides a quick overview of how to use the command line for the three OS supported, and the arguments available.

Windows
macOS
Linux
Command Line Options
Troubleshooting

Windows

Top ↑

Open a command prompt (Start button, then type ‘cmd’ or search programs and files for ‘Windows Command Prompt’). Move into the SEO Spider directory (64-bit) by entering:

cd "C:\Program Files (x86)\Screaming Frog SEO Spider"

Or for 32-bit:

cd "C:\Program Files\Screaming Frog SEO Spider"

On Windows, there is a separate build of the SEO Spider called ScreamingFrogSEOSpiderCli.exe (rather than the usual ScreamingFrogSEOSpider.exe). This can be run from the Windows command line and behaves like a typical console application.

You can type ScreamingFrogSEOSpiderCli.exe –help to view all arguments and see all logging come out of the CL.

CLI

To open a saved crawl:

ScreamingFrogSEOSpider.exe C:\Temp\crawl.seospider

To auto start a crawl:

ScreamingFrogSEOSpiderCli.exe --crawl https://www.example.com

Windows CLI

Then additional arguments can merely be appended with a space.

For example, the following will mean the SEO Spider runs headless, saves the crawl, outputs to your desktop and exports the internal and response codes tabs, and client error filter.

ScreamingFrogSEOSpiderCli.exe --crawl https://www.example.com --headless --save-crawl --output-folder "C:\Users\Your Name\Desktop" --export-tabs “Internal:All,Response Codes:Client Error (4xx)”
Windows CLI with more arguments

Please see the full list of command line options available to supply as arguments for the SEO Spider.

macOS

Top ↑

Open a terminal, found in the Utilities folder in the Applications folder, or directly using spotlight and typing: ‘Terminal’.

There are two ways to start the SEO Spider from the command line. You can use either the open command or the ScreamingFrogSEOSpiderLauncher script. The open command returns immediately allowing you to close the Terminal after. The ScreamingFrogSEOSpiderLauncher logs to the Terminal until the SEO Spider exits, closing the Terminal kills the SEO Spider.

To start the UI using the open command:

open "/Applications/Screaming Frog SEO Spider.app"

To start the UI using the ScreamingFrogSEOSpiderLauncher script:

/Applications/Screaming\ Frog\ SEO\ Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher

To see a full list of the command line options available:

/Applications/Screaming\ Frog\ SEO\ Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher --help

The following examples we show both ways of launching the SEO Spider.

To open a saved crawl file:

open "/Applications/Screaming Frog SEO Spider.app" --args /tmp/crawl.seospider
/Applications/Screaming\ Frog\ SEO\ Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher /tmp/crawl.seospider

To start the UI and immediately start crawling:

open "/Applications/Screaming Frog SEO Spider.app" --args --crawl https://www.example.com/
/Applications/Screaming\ Frog\ SEO\ Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher --crawl https://www.example.com/

To start headlesss, immediately start crawling and save the crawl along with Internal->All and Response Codes->Client Error (4xx) filters:

open "/Applications/Screaming Frog SEO Spider.app" --args --crawl https://www.example.com --headless --save-crawl --output-folder /tmp/cli --export-tabs "Internal:All,Response Codes:Client Error (4xx)"
/Applications/Screaming\ Frog\ SEO\ Spider.app/Contents/MacOS/ScreamingFrogSEOSpiderLauncher --crawl https://www.example.com --headless --save-crawl --output-folder /tmp/cli --export-tabs "Internal:All,Response Codes:Client Error (4xx)"

Please see the full list of command line options available to supply as arguments for the SEO Spider.

Linux

Top ↑

This screamingfrogseospider binary is placed in your path during installation. To run this open a terminal and follow the examples below.

To start normally:

screamingfrogseospider

To open a saved crawl file:

screamingfrogseospider /tmp/crawl.seospider

To see a full list of the command line options available:

screamingfrogseospider --help

To start the UI and immediately start crawling:

screamingfrogseospider --crawl https://www.example.com/

To start headlesss, immediately start crawling and save the crawl along with Internal->All and Response Codes->Client Error (4xx) filters:

screamingfrogseospider --crawl https://www.example.com --headless --save-crawl --output-folder /tmp/cli --export-tabs "Internal:All,Response Codes:Client Error (4xx)"

Please see the full list of command line options below.

Command Line Options

Top ↑

Please see the full list of command line options available to supply as arguments for the SEO Spider.

--crawl https://www.example.com

Start crawling the supplied URL.

--crawl-list [list file]

Start crawling the specified URLs in list mode.

--config [config]

Supply a saved configuration file for the SEO Spider to use.

--headless

Run in silent mode without an user interface.

--save-crawl

Save the completed crawl.

--output-folder [output]

Store saved files. Default: current working directory.

--overwrite

Overwrite files in output directory.

--timestamped-output

Create a timestamped folder in the output directory, and store all output there.

--export-tabs [tab:filter,...]

Supply a comma separated list of tabs to export. You need to specify the tab name and the filter name separated by a colon. Tab names are as they appear on the user interface, except for those configurable via Configuration->Spider->Preferences where X is used. Eg: Meta Description:Over X Characters

--bulk-export [[submenu:]export,...]

Supply a comma separated list of bulk exports to perform. The export names are the same as in the Bulk Export menu in the UI. To access exports in a submenu, use ‘submenu-name:export-name’.

--save-report [[submenu:]report,...]

Supply a comma separated list of reports to save. The report names are the same as in the Reports menu in the UI. To access reports in a submenu, use ‘submenu-name:report-name’.

--create-sitemap

Creates a sitemap from the completed crawl.

--create-images-sitemap

Creates an images sitemap from the completed crawl.

--export-format [csv|xls|xlsx]

Supply a format to be used for all exports.

--use-google-analytics [google account] [account] [property] [view] [segment]

Use the Google Analytics API during crawl.

--use-google-search-console [google account] [website]

Use the Google Search Console API during crawl.

--use-majestic

Use the Majestic API during crawl.

--use-mozscape

Use the Mozscape API during crawl.

--use-ahrefs

Use the Ahrefs API during crawl.

-h, --help

View this list of options.

Troubleshooting

Top ↑

  • If a headless crawl fails to export any results make sure that –output-folder exists and is either empty or you are using the –timestamped-output option.

  • Like us on Facebook
  • +1 us on Google Plus
  • Connect with us on LinkedIn
  • Follow us on Twitter
  • View our RSS feed

Download.

Download

Purchase a licence.

Purchase