SEO Spider Configuration

Spider Basic Tab

Check Images

Untick this box if you do not want to crawl images.

Please note, we check the link but don’t crawl the content. This prevents the spider checking images linked to using the image tag (img src=”image.jpg”).

Images linked to via any other means will still be checked, for example, using an anchor tag (a href=”image.jpg”).

Check CSS

Untick this box if you do not want to check CSS links. 

Check JavaScript

Untick this box if you do not want to check JavaScript links. 

Check SWF

Untick this box if you do not want to check flash links. 

Follow internal or external ‘nofollow’

By default the Spider will not crawl internal or external links with the ‘nofollow’ attribute or external links from pages with the meta nofollow tag. If you would like the Spider to crawl these, simply tick the relevant option. 

Crawl all subdomains

By default the SEO Spider will only crawl the subdomain you crawl from and treat all other subdomains encountered as external sites. To crawl all subdomains of a root domain, use this option.

Crawl outside of start folder

By default the SEO Spider will only crawl the subfolder (or sub directory) you crawl from forwards. However, if you wish to start a crawl from a specific sub folder, but crawl the entire website, use this option.

Crawl canonicals

By default the SEO Spider will crawl canonicals (canonical link elements or http header) and use the links contained within for discovery. If you do not wish to crawl canonicals, then please untick this box.

Please note, that canonicals will still be reported and referenced in the SEO Spider, but they will not be crawled for discovery.

Crawl next/prev

By default the SEO Spider will not crawl rel=”next” and rel=”prev” elements or use the links contained within it for discovery. If you wish to crawl the pages referenced in rel=”next” and rel=”prev” elements, please tick this box.

Extract hreflang

By default the SEO Spider will extract hreflang attributes and display hreflang language and region codes and the URL under the hreflang tab.

However, the URLs found in the hreflang attributes will not be crawled, unless ‘Crawl hreflang’ is ticked.

With this setting enabled hreflang url’s will be extracted from a sitemap uploaded in list mode.

Crawl hreflang

Enable this configuration for URLs discovered in the hreflang attributes to be crawled.

Spider Limits Tab

Limit crawl total

The free version of the software has a 500 URL crawl limit. If you have a licensed version of the tool this will be replaced with 5 million URLs, but you can include any number here for greater control over the number of pages you wish to crawl.

Limit crawl depth

You can choose how deep the SEO Spider crawls a site (in terms of links away from your chosen start point).

Limit max URL length to crawl

Control the length of URLs that the SEO Spider will crawl.

There’s a default max URL length of 2,000, due to the limits of the database storage.

Limit max folder depth

Control the number of folders (or sub directories) the SEO Spider will crawl.

Limit number of query strings

Control the number of query string parameters (?x=) the SEO Spider will crawl.

Spider Rendering Tab

Rendering

Allows you to set the rendering mode for the crawl:

  • Text Only: Looks only at the raw HTML, ignores the AJAX Crawling Scheme and JavaScript.
  • Old AJAX Crawling Scheme: Implements Google’s now deprecated AJAX Crawling Scheme.
  • JavaScript: Executes JavaScript, inspecting the DOM for links/titles etc.

Please note: To emulate Googlebot as closely as possible our rendering engine uses the Chromium project. The following operating systems are supported:

  • Windows 10
  • Windows 8 & 8.1
  • Windows 7
  • Windows Server 2008 R2
  • Windows Server 2012
  • Windows Server 2016
  • Ubuntu 14.04+ (64-bit only)
  • Mac OS X 10.9+

Note: If you are running a supported OS and are still unable to use rendering, it could be you are running in compatibility mode. To check this, go to your installation directory (C:\Program Files (x86)\Screaming Frog SEO Spider\) , right click on ScreamingFrogSEOSpider.exe, select Properties then the Compatibility tab and check you don’t have anything ticked under the Compatibility mode section.

Rendered Page Screen Shots

This configuration is ticked by default when selecting JavaScript rendering and means screen shots are captured of rendered pages, which can be viewed in the ‘Rendered Page‘ tab which dynamically appears in the lower window pane.

You can select various window sizes from Googlebot desktop, Googlebot mobile and various other devices.

The rendered screenshots are viewable within the ‘C:\Users\User Name\.ScreamingFrogSEOSpider\screenshots-XXXXXXXXXXXXXXX’ folder, and can be exported via the ‘bulk export > Screenshots’ top level menu, to save navigating, copying and pasting.

AJAX Timeout

This is how long, in seconds, the SEO Spider should allow JavaScript to execute before considering a page loaded. Our research has shown Googlebot waits about 5 seconds.

Window Size

This sets the view port size in JavaScript rendering mode, which can be seen in the rendered page screen shots captured in the ‘Rendered Page‘ tab which dynamically appears in the lower window pane.

Spider Advanced Tab

Allow cookies

As default the SEO Spider does not accept cookies, like a search bot. However, you can choose to accept cookies by ticking this box.

Pause on high memory usage

The SEO Spider will automatically pause when a crawl has reached the memory allocation and display a ‘high memory usage’ message. However, you can choose to turn this safeguard off completely.

Always follow redirects

This feature allows the SEO Spider to follow redirects until the final redirect target URL in list mode, ignoring crawl depth. This is particularly useful for site migrations, where URLs may perform a number of 3XX redirects, before they reach their final destination. To view the chain of redirects, we recommend using the ‘redirect chains‘ report.

Respect noindex

This option means URLs with ‘noindex’ will not be reported in the SEO Spider.

Respect canonical

This option means URLs which have been canonicalised to another URL, will not be reported in the SEO Spider.

Respect Next/Prev

This option means URLs with a rel=”prev” in the sequence, will not be reported in the SEO Spider. Only the first URL in the paginated sequence with a rel=”next” attribute will be reported.

Extract Images From img srcset Attribute

If enabled will extract images from the srcset attribute of the <img> tag. In the example below this would be image-1x.png and image-2x.png as well as image-src.png.

<img src="image-src.png" srcset="image-1x.png 1x, image-2x.png 2x" alt="Retina friendly images" />

Response timeout

As default the SEO Spider will wait 20 seconds to get any kind of http response from a URL. You can increase the length of waiting time, which is useful for very slow websites.

5XX Response Retries

This option provides the ability to automatically re-try 5XX responses. Often these responses can be temporary, so re-trying a URL may provide a 2XX response.

Max redirects to follow

This option provides the ability to control the number of redirects the SEO Spider will follow.

Store HTML

This allows you to save the static HTML of every URL crawled by the SEO Spider to disk, and view it in the ‘View Source’ lower window pane (on the left hand side, under ‘Original HTML’). This enables you to view the original HTML before JavaScript comes into play, in the same way as a right click ‘view source’ in a browser. This is great for debugging, or for comparing against the rendered HTML.

Store rendered HTML

This allows you to save the rendered HTML of every URL crawled by the SEO Spider to disk, and view in the ‘View Source’ lower window pane (on the right hand side, under ‘Rendered HTML’). This enables you to view the DOM like ‘inspect element’ (in Chrome in DevTools), after JavaScript has been processed.

Please note, this option will only work when JavaScript rendering is enabled.

Spider Preferences Tab

Page title & meta description width

This option provides the ability to control the character and pixel width limits in the SEO Spider filters in the page title and meta description tabs. For example, changing the minimum pixel width default number of ‘200’, would change the ‘Below 200 Pixels’ filter in the ‘Page Titles’ tab. This allows you to set your own character and pixel width based upon your own preferences.

Please note – This does not update the SERP Snippet preview at this time, only the filters within the tabs.

Other character preferences

These options provide the ability to control the character length of URLs, h1, h2 and image alt text filters in their respective tabs. You can also control the max image size.

Other Configuration Options

robots.txt Settings

Ignore robots.txt

By default the SEO Spider will obey robots.txt protocol. The SEO Spider will not be able to crawl a site if its disallowed via robots.txt. However, this option allows you to ignore this protocol which is down to the responsibility of the user. This option actually means the SEO Spider will not even download the robots.txt file. So it also means ALL robots directives will be completely ignored.

Show Internal URLs Blocked By Robots.txt

By default internal URLs blocked by robots.txt will be shown in the ‘Internal’ tab with Status Code of ‘0’ and Status ‘Blocked by Robots.txt’. To hide these URLs in the interface unselect this option. This option is not available if ‘Ignore robots.txt’ is checked.

You can also view internal URLs blocked by robots.txt under the ‘Response Codes’ tab and ‘Blocked by Robots.txt’ filter. This will also show the robots.txt directive (‘matched robots.txt line’ column) of the disallow against each URL that is blocked.

Show External URLs Blocked By Robots.txt

By default external URLs blocked by robots.txt are hidden. To display these in the External tab with Status Code ‘0’ and Status ‘Blocked by Robots.txt’ check this option. This option is not available if ‘Ignore robots.txt’ is checked.

You can also view external URLs blocked by robots.txt under the ‘Response Codes’ tab and ‘Blocked by Robots.txt’ filter. This will also show robots.txt directive (‘matched robots.txt line column’) of the disallow against each URL that is blocked.

Custom robots.txt

You can download, edit and test a site’s robots.txt using the custom robots.txt feature under ‘Configuration > robots.txt > Custom’ which will override the live version on the site for the crawl. It will not update the live robots.txt on the site. This feature allows you to add multiple robots.txt at subdomain level, test directives in the SEO Spider and view URLs which are blocked or allowed.

The custom robots.txt uses the selected user-agent in the configuration.

custom robots.txt

During a crawl you can filter blocked URLs based upon the custom robots.txt (‘Response Codes > Blocked by robots.txt’) and see the matches robots.txt directive line.

URLs blocked by robots.txt

Please read our featured user guide using the SEO Spider as a robots.txt tester.

Please note – As mentioned above, the changes you make to the robots.txt within the SEO Spider, do not impact your live robots.txt uploaded to your server. You can however copy and paste these into the live version manually to update your live directives.

URL rewriting

The URL rewriting feature allows you to rewrite URLs on the fly. For the majority of cases, the ‘remove parameters’ and common options (under ‘options’) will suffice. However, we do also offer an advanced regex replace feature which provides further control.

URL rewriting is only applied to URLs discovered in the course of crawling a website, not URLs that are entered as the start of a crawl in ‘Spider’ mode, or as part of a set of URLs in ‘List’ mode.

Remove Parameters

This feature allows you to automatically remove parameters in URLs. This is extremely useful for websites with session IDs, Google Analytics tracking or lots of parameters which you wish to remove. For example –

If the website has session IDs which make the URLs appear something like this ‘example.com/?sid=random-string-of-characters’. To remove the session ID, you just need to add ‘sid’ (without the apostrophes) within the ‘parameters’ field in the ‘remove parameters’ tab.

remove parameters, like session IDs yo

The SEO Spider will then automatically strip the session ID from the URL. You can test to see how a URL will be rewritten by our SEO Spider under the ‘test’ tab.

url rewriting test

This feature can also be used for removing Google Analytics tracking parameters. For example, you can just include the following under ‘remove parameters’ –

utm_source
utm_medium
utm_campaign

This will strip the standard tracking parameters from URLs.

Or potentially in JavaScript rendering mode, you may wish to remove anything after the hash value. You can simply include the following within the remove parameters box –

#.*

Options

We will include common options under this section. The ‘lowercase discovered URLs’ option does exactly that, it converts all URLs crawled into lowercase which can be useful for websites with case sensitivity issues in URLs.

Regex Replace

This advanced feature runs against each URL found during a crawl or in list mode. It replaces each substring of a URL that matches the regex with the given replace string. The “Regex Replace” feature can be tested in the “Test” tab of the “URL Rewriting” configuration window.

url rewriting HTTP to HTTPS

Examples are:

1) Changing all links from http to https

Regex: http
Replace: https

2) Changing all links to example.com to be example.co.uk

Regex: .com
Replace: .co.uk

3) Making all links containing page=number to a fixed number, eg

www.example.com/page.php?page=1
www.example.com/page.php?page=2
www.example.com/page.php?page=3
www.example.com/page.php?page=4

To make all these go to www.example.com/page.php?page=1

Regex: page=\d+
Replace: page=1

4) Removing the www. domain from any url by using an empty Replace. If you want to remove a query string parameter, please use the “Remove Parameters” feature – Regex is not the correct tool for this job!

Regex: www.
Replace:

5) Stripping all paramters

Regex: \?.*
Replace:

6) Changing links for only subdomains of example.com from http to https

Regex: http://(.*example.com)
Replace: https://$1

Include

This feature allows you to control which URL path the SEO Spider will crawl via regex. It narrows the default search by only crawling the URLs that match the regex which is particularly useful for larger sites, or sites with less intuitive URL structures. Matching is performed on the url encoded version of the URL.

The page that you start the crawl from must have an outbound link which matches the regex for this feature to work. (Obviously if there is not a URL which matches the regex from the start page, the SEO Spider will not crawl anything!).

  • As an example, if you wanted to crawl pages from https://www.screamingfrog.co.uk which have ‘search’ in the URL string you would simply include the regex:
    .*search.*
    in the ‘include’ feature. This would find the /search-engine-marketing/ and /search-engine-optimisation/ pages as they both have ‘search’ in them.

Troubleshooting

  • Matching is done on then URL Encoded Address, you can see what this is in the URL Info tab in the lower window pane.
  • The regular expression must match the whole URL, not just part of it.

Exclude

The exclude configuration allows you to exclude URLs from a crawl by supplying a list of regular expressions (regex). A URL that matches an exclude is not crawled at all (it’s not just ‘hidden’ in the interface). Hence, this will mean other URLs that do not match the exclude, but can only be reached from an excluded page will also not be found in the crawl.

The exclude list is applied to new URLs that are discovered during the crawl. This exclude list does not get applied to the initial URL(s) supplied in crawl or list mode. Changing the exclude list during a crawl will affect newly discovered URLs and now be applied retrospectively to the list of pending URLs. Matching is performed on the URL encoded version of the URL. You can see the encoded version of a URL by selecting it in the main window then in the lower window pane in the details tab looking at the URL Info tab, and the value second row labelled “URL Encoded Address”.

Here are some common examples –

  • To exclude a specific URL or page the syntax is:
    http://www.example.com/do-not-crawl-this-page.html
  • To exclude a sub directory or folder the syntax is:
    http://www.example.com/do-not-crawl-this-folder/.*
  • To exclude everything after brand where there can sometimes be other folders before:
    http://www.example.com/.*/brand.*
  • If you wish to exclude URIs with a certain parameter such as ‘?price’ contained in a variety of different directories you can simply use (Note the ? is a special character in regex and must be escaped):
    .*\?price.*
  • If you wanted to exclude all files ending jpg, the regex would be:
    .*jpg$
  • If you wanted to exclude any URI with ‘produce’ within them, the regex would be:
    .*produce.*
  • Secure (https) pages would be:
    .*https.*
  • Excluding all pages on http://www.domain.com would be:
    http://www.domain.com/.*
  • If you want to exclude a URL and it doesn’t seem to be working, its probably because it contains special regex characters such as ?. Rather trying to locate and escape these individually, you can escape the whole line starting with \Q and ending with \E as follow:
    \Qhttp://www.example.com/test.php?product=special\E
  • Remember to use the encoded version of the URL. So if you wanted to exclude any URLs with a pipe |, it would be:
    .*%7C.*

You can also view our video guide about the exclude feature in the SEO Spider –

Speed

The speed configuration allows you to control the speed of the SEO Spider, either by number of concurrent threads, or by URLs requested per second.

When reducing speed, it’s always easier to control by the ‘Max URI/s’ option, which is the maximum number of URL requests per second. For example, the screenshot below would mean crawling at 1 URL per second –

SEO Spider Configuration

The ‘Max Threads’ option can simply be left alone when you throttle speed via URLs per second.

Increasing the number of threads allows you to significantly increase the speed of the SEO Spider. By default the SEO Spider crawls at 5 threads, to not overload servers.

Please use the threads configuration responsibly, as setting the number of threads high to increase the speed of the crawl will increase the number of HTTP requests made to the server and can impact a site’s response times. In very extreme cases, you could overload a server and crash it.

We recommend approving a crawl rate with the webmaster first, monitoring response times and adjusting speed if there are any issues.

User agent

You can find the ‘User-Agent’ configuration under ‘Configuration > HTTP Header > User-Agent’.

The user-agent switcher has inbuilt preset user agents for Googlebot, Bingbot, Yahoo! Slurp, various browsers and more. This feature also has a custom user-agent setting which allows you to specify your own user agent.

Details on how the SEO Spider handles robots.txt can be found here.

HTTP Header

The HTTP Header configuration allows you to supply completely custom header requests during a crawl.

Custom HTTP Headers

This means you’re able to set anything from accept-language, cookie, referer, or just supplying any unique header name. For example, there are scenarios where you may wish to supply an Accept-Language HTTP header in the SEO Spider’s request to crawl locale-adaptive content.

You can choose to supply any language and region pair that you require within the header value field.

Custom extraction

The custom extraction feature allows you to collect any data from the HTML of a URL. Extraction is performed on the static html returned by internal HTML pages with a 2xx response code. The SEO Spider supports the following modes to perform data extraction:

  • XPath: XPath selectors, including attributes.
  • CSS Path: CSS Path and optional attribute.
  • Regex: For more advanced uses, such as scraping HTML comments or inline JavaScript.

When using XPath or CSS Path to collect HTML, you can choose what to extract:

  • Extract HTML Element: The selected element and its inner HTML content.
  • Extract Inner HTML: The inner HTML content of the selected element. If the selected element contains other HTML elements, they will be included.
  • Extract Text: The text content of the selected element and the text content of any sub elements.
  • Function Value: The result of the supplied function, eg count(//h1) to find the number of h1 tags on a page.

You’ll receive a tick next to your regex, Xpath or CSS Path if the syntax is valid. If you’ve made a mistake, a red cross will remain in place!

The results of the data extraction appear under the ‘custom’ tab in the ‘extraction’ filter. They are also included as columns within the ‘Internal’ tab as well.

For more details on custom extraction please see our Web Scraping Guide.

For examples of custom extraction expressions, please see our XPath Examples and Regex Examples.

Regex Troubleshooting

  • The SEO Spider does not pre process HTML before running regexes. Please bear in mind however that the HTML you see in a browser when viewing source maybe different to what the SEO Spider sees. This can be caused by the web site returning different content based on User-Agent or Cookies, or if the pages content is generated using JavaScript and you are not using JavaScript rendering.
  • More details on the regex engine used by the SEO Spider can be found here.
  • The regex engine is configured such that the dot character matches newlines.

Google Analytics integration

You can connect to the Google Analytics API and pull in data directly during a crawl. The SEO Spider can fetch user and session metrics, as well as goal conversions and ecommerce (transactions and revenue) data for landing pages, so you can view your top performing pages when performing a technical or content audit.

If you’re running an Adwords campaign, you can also pull in impressions, clicks, cost and conversion data and the SEO Spider will match your destination URLs against the site crawl, too. You can also collect other metrics of interest, such as Adsense data (Ad impressions, clicks revenue etc), site speed or social activity and interactions.

To set this up, start the SEO Spider and go to ‘Configuration > API Access > Google Analytics’.

Google Analytics config

Then you just need to connect to a Google account (which has access to the Analytics account you wish to query) by granting the ‘Screaming Frog SEO Spider’ app permission to access your account to retrieve the data. Google APIs use the OAuth 2.0 protocol for authentication and authorisation. The SEO Spider will remember any Google accounts you authorise within the list, so you can ‘connect’ quickly upon starting the application each time.

Google Analytics set-up

Once you have connected, you can choose the relevant Google Analytics account, property, view, segment and date range!

Google Analytics user account

Then simply select the metrics that you wish to fetch! The SEO Spider currently allow you to select up to 30, which we might extend further. If you keep the number of metrics to 10 or below with a single dimension (as a rough guide), then it will generally be a single API query per 10k URLs, which makes it super quick –

Google Analytics metrics

As default the SEO Spider collects the following 10 metrics –

  1. Sessions
  2. % New Sessions
  3. New Users
  4. Bounce Rate
  5. Page Views Per Pession
  6. Avg Session Duration
  7. Page Value
  8. Goal Conversion Rate
  9. Goal Completions All
  10. Goal Value All

You can read more about the definition of each metric from Google.

You can also set the dimension of each individual metric against either page path and, or landing page which are quite different (and both useful depending on your scenario & objectives).

Google analytics dimension

There are scenarios where URLs in Google Analytics might not match URLs in a crawl, so we cover these by matching trailing and non-trailing slash URLs and case sensitivity (upper and lowercase characters in URLs). Google doesn’t pass the protocol (HTTP or HTTPS) via their API, so we also match this data automatically.

If you have hundreds of thousands of URLs in GA, you can choose to limit the number of URLs to query, which is by default ordered by sessions to return the top performing page data.

Google Analytics general settings

When you hit ‘start’ to crawl, the Google Analytics data will then be fetched and display in respective columns within the ‘Internal’ and ‘Analytics’ tabs. There’s a separate ‘Analytics’ progress bar in the top right and when this has reached 100%, crawl data will start appearing against URLs. The more URLs you query, the longer this process can take, but generally it’s extremely quick.

Google Analytics integration

There are 3 filters currently under the ‘Analytics’ tab, which allow you to filter the Google Analytics data –

  • Sessions Above 0 – This simply means the URL in question has 1 or more sessions.
  • Bounce Rate Above 70% – This means the URL has a bounce rate over 70%, which you may wish to invesigate. In some scenarios this is normal though!
  • No GA Data – This means that for the metrics and dimensions queried, the Google API didn’t return any data for the URLs in the crawl. So the URLs either didn’t receive any visits sessions, or perhaps the URLs in the crawl are just different to those in GA for some reason.
Google Analytics No GA Data

As an example for our own website, we can see there is ‘no GA data’ for blog category pages and a few old blog posts, as you might expect (the query was landing page, rather than page). Remember, you may see pages appear here which are ‘noindex’ or ‘canonicalised’, unless you have ‘respect noindex‘ and ‘respect canonicals‘ ticked in the advanced configuration tab.

If GA data does not get pulled into the SEO Spider as you expected, then analyse the URLs in GA under ‘Behaviour > Site Content > All Pages’ and ‘Behaviour > Site Content > Landing Pages’ depending on which dimension you choose in your query. The URLs here need to match those in the crawl, for the data to be matched accurately. If they don’t match, then the SEO Spider won’t be able to match up the data accurately.

We recommend checking your default Google Analytics view settings (such as ‘default page’) and filters which all impact how URLs are displayed and hence matched against a crawl. If you want URLs to match up, you can often make the required amends within Google Analytics.

Google Search Console Integration

You can connect to the Google Search Analytics API and pull in data directly during a crawl. The SEO Spider can fetch impressions, clicks, CTR and position metrics from Google Search Analytics, so you can view your top performing pages when performing a technical or content audit.

To set this up, start the SEO Spider and go to ‘Configuration > API Access > Google Search Console’. Connecting to Google Search Console works in the same way as already detailed in our step by step Google Analytics integration guide.

You just need to connect to a Google account (which has access to the Search Console account you wish to query) by granting the ‘Screaming Frog SEO Spider’ app permission to access your account to retrieve the data. Google APIs use the OAuth 2.0 protocol for authentication and authorisation. The SEO Spider will remember any Google accounts you authorise within the list, so you can ‘connect’ quickly upon starting the application each time.

Once you have connected, you can choose the relevant website property, date range and dimensions!

Google Search Console Integration

By default the SEO Spider collects the following metrics –

  • Clicks
  • Impressions
  • CTR
  • Position

You can read more about the definition of each metric from Google.

There are three dimension filters, device type (desktop, tablet and mobile), a country filter and a search query filter for ‘contain’ or ‘doesn’t contain’ words, to exclude brand queries as an example.

Google search console dimension filter

There are 2 filters currently under the ‘Search Console’ tab, which allow you to filter the Google Search Console data

  • Clicks Above 0 – This simply means the URL in question has 1 or more clicks.
  • No GSC Data – This means that the API didn’t return any data for the URLs in the crawl. So the URLs either didn’t receive any impressions, or perhaps the URLs in the crawl are just different to those in GA for some reason.

Majestic

In order to use Majestic, you will need a subscription which allows you to pull data from their API. You then just need to navigate to ‘Configuration > API Access > Majestic’ and then click on the ‘generate an Open Apps access token’ link.

Majestic API

You will then be taken to Majestic, where you need to ‘grant’ access to the Screaming Frog SEO Spider.

Majestic API grant access

You will then be given a unique access token from Majestic.

Majestic API authorised

Copy and input this token into the API key box in the Majestic window, and click ‘connect’ –

Majestic API connected

You can then select the data source (fresh or historic) and metrics, at either URL, subdomain or domain level.

Majestic API metrics

Then simply click ‘start’ to perform your crawl, and the data will be automatically pulled via their API, and can be viewed under the ‘link metrics’ and ‘internal’ tabs.

link metrics integration

Ahrefs

In order to use Ahrefs, you will need a subscription which allows you to pull data from their API. You then just need to navigate to ‘Configuration > API Access > Ahrefs’ and then click on the ‘generate an API access token’ link.

ahrefs API integration

You will then be taken to Ahrefs, where you need to ‘allow’ access to the Screaming Frog SEO Spider.

Ahrefs API access

You will then be given a unique access token from Ahrefs (but hosted on the Screaming Frog domain).

ahrefs API token

Copy and input this token into the API key box in the Ahrefs window, and click ‘connect’ –

Connect to Ahrefs API

You can then select the metrics you wish to pull at either URL, subdomain or domain level.

Ahrefs API metrics

Then simply click ‘start’ to perform your crawl, and the data will be automatically pulled via their API, and can be viewed under the ‘link metrics’ and ‘internal’ tabs.

Moz

You will require a Moz account to pull data from the Mozscape API. Moz offer a free limited API and a separate paid API, which allows users to pull more metrics, at a faster rate. Please note, this is a seperate subscription to a standard Moz PRO account. You can read about free vs paid access over at Moz.

To access the API, with either a free account, or paid subscription, you just need to login to your Moz account and view your API ID and secret key.

Moz API key

Copy and input both the access ID and secret key into the respective API key boxes in the Moz window under ‘Configuration > API Access > Moz’, select your account type (‘free’ or ‘paid’), and then click ‘connect’ –

moz API integration

You can then select the metrics available to you, based upon your free or paid plan. Simply choose the metrics you wish to pull at either URL, subdomain or domain level.

moz API metrics

Then simply click ‘start’ to perform your crawl, and the data will be automatically pulled via their API, and can be viewed under the ‘link metrics’ and ‘internal’ tabs.

Authentication

The SEO Spider supports two forms of authentication, standards based which includes basic and digest authentication, and web forms based authentication.

Basic & Digest Authentication

There is no set-up required for basic and digest authentication, it is detected automatically during a crawl of a page which requires a login. If you visit the website and your browser gives you a pop-up requesting a username and password, that will be basic or digest authentication. If the login screen is contained in the page itself, this will be a web form authentication, which is discussed in the next section.

Often sites in development will also be blocked via robots.txt as well, so make sure this is not the case or use the ‘ignore robot.txt configuration‘. Then simply insert the staging site URL, crawl and a pop-up box will appear, just like it does in a web browser, asking for a username and password.

basic authentication

Enter your credentials and the crawl will continue as normal. You cannot pre-enter login credentials – they are entered when URLs that require authentication are crawled. This feature does not require a licence key. Try to following pages to see how authentication works in your browser, or in the SEO Spider.

Web Form Authentication

There are other web forms and areas which require you to login with cookies for authentication to be able to view or crawl it. The SEO Spider allows users to log in to these web forms within the SEO Spider’s built in Chromium browser, and then crawl it. This feature requires a licence to use it.

To log in, simply navigate to ‘Configuration -> Authentication’ then switch to the ‘Forms Based’ tab, click the ‘Add’ button, enter the URL for the site you want to crawl, and a browser will pop up allowing you to log in.

Web Form authentication

Please read our guide on crawling web form password protected sites in our user guide, before using this feature. Some website’s may also require JavaScript rendering to be enabled when logged in to be able to crawl it.

Please note – This is a very powerful feature, and should therefore be used responsibly. The SEO Spider clicks every link on a page; when you’re logged in that may include links to log you out, create posts, install plugins, or even delete data.

Troubleshooting

  • Forms based authentication uses the configured User Agent. If you are unable to login, perhaps try this as Chrome or another browser.

Memory

The SEO Spider uses Java which requires memory to be allocated at start-up. By default the SEO Spider will allow 1gb for 32-bit, and 2gb for 64-bit machines.

Increasing memory allocation will enable the SEO Spider to crawl more URLs, particularly when in RAM storage mode, but also when storing to database.

We recommend setting the memory allocation to 2gb below you’re total physcial machine memory.

Set Memory Allocation

If you’d like to find out more about crawling large websites, memory allocation and the storage options available, please see our guide on crawling large websites.

Storage

The Screaming Frog SEO Spider uses a configurable hybrid engine, allowing users to choose to store crawl data in RAM, or in a database.

database storage mode

By default the SEO Spider uses RAM, rather than your hard disk to store and process data. This provides amazing benefits such as speed and flexibility, but it does also have disadvantages, most notably, crawling at scale.

However, if you have an SSD the SEO Spider can also be configured to save crawl data to disk, by selecting ‘Database Storage’ mode (under ‘Configuration > System > Storage’), which enables it to crawl at truly unprecedented scale, while retaining the same, familiar real-time reporting and usability.

Fundamentally both storage modes can still provide virtually the same crawling experience, allowing for real-time reporting, filtering and adjusting of the crawl. However, there are some key differences, and the ideal storage, will depend on the crawl scenario, and machine specifications.

Memory Storage

Memory storage mode allows for super fast and flexible crawling for virtually all set-ups. However, as machines have less RAM than hard disk space, it means the SEO Spider is generally better suited for crawling websites under 500k URLs in memory storage mode.

Users are able to crawl more than this with the right set-up, and depending on how memory intensive the website is that’s being crawled. As a very rough guide, a 64-bit machine with 8gb of RAM will generally allow you to crawl a couple of hundred thousand URLs.

As well as being a better option for smaller websites, memory storage mode is also recommended for machines without an SSD, or where there isn’t much disk space.

Database Storage

We recommend this as the default storage for users with an SSD, and for crawling at scale. Database storage mode allows for more URLs to be crawled for a given memory setting, with close to RAM storage crawling speed for set-ups with a solid state drive (SSD).

The default crawl limit is 5 million URLs, but it isn’t a hard limit – the SEO Spider is capable of crawling significantly more (with the right set-up). As an example, a machine with a 500gb SSD and 16gb of RAM, should allow you to crawl up to 10 million URLs approximately.

While not recommended, if you have a fast hard disk drive (HDD), rather than a sold state disk (SSD), then this mode can still allow you to crawl more URLs. However, writing and reading speed of a hard drive does become the bottleneck in crawling – so both crawl speed, and the interface itself will be significantly slower.

If you’re working on the machine while crawling, it can also impact machine performance, so the crawl speed might require to be reduced to cope with the load. SSDs are so fast, they generally don’t have this problem and this is why ‘database storage’ can be used as the default for both small and large crawls.

Troubleshooting

  • ExFAT/MS-DOS (FAT) file systems are not supported on macOS due to JDK-8205404.

Proxy

This feature (Configuration > Proxy) allows you the option to configure the SEO Spider to use a proxy server.

You will need to configure the address and port of the proxy in the configuration window. To disable the proxy server untick the ‘Use Proxy Server’ option.

Please note:

  • Only 1 proxy server can be configured.
  • You must restart for your changes to take effect.
  • No exceptions can be added – either all HTTP/HTTPS traffic goes via the proxy or non of it does.
  • Some proxies may require you to input login details before the crawl using forms based authentication.

Mode

Spider Mode

This is the default mode of the SEO Spider. In this mode the SEO Spider will crawl a web site, gathering links and classifying URLs into the various tabs and filters. Simply enter the URL of your choice and click ‘start’.

List Mode

In this mode you can check a predefined list of URLs. This list can come from a variety of sources – a simple copy and paste, or a .txt, .xls, .xlsx, .csv or .xml file. The files will be scanned for http:// or https:// prefixed URLs, all other text will be ignored. For example, you can directly upload an Adwords download and all URLs will be found automatically.

list mode

If you’re performing a site migration and wish to test URLs, we highly recommend using the ‘always follow redirects‘ configuration so the SEO Spider finds the final destination URL. The best way to view these is via the ‘redirect chains’ report, and we go into more detail within our ‘How To Audit Redirects‘ guide.

List mode changes the crawl depth setting to zero, which means only the uploaded URLs will be checked. If you want to check links from these URLs, adjust the crawl depth to 1 or more in the ‘Limits’ tab in ‘Configuration > Spider’. List mode also sets the spider to ignore robots.txt by default, we assume if a list is being uploaded the intention is to crawl all the URLs in the list.

If you wish to export data in list mode in the same order it was uploaded, then use the ‘Export’ button which appears next to the ‘upload’ and ‘start’ buttons at the top of the user interface.

The data in the export will be in the same order and include all of the exact URLs in the original upload, including duplicates or any fix-ups performed.

SERP Mode

In this mode you can upload page titles and meta descriptions directly into the SEO Spider to calculate pixel widths (and character lengths!). There is no crawling involved in this mode, so they do not need to be live on a website.

This means you can export page titles and descriptions from the SEO Spider, make bulk edits in Excel (if that’s your preference, rather than in the tool itself) and then upload them back into the tool to understand how they may appear in Google’s SERPs.

Under ‘reports’, we have a new ‘SERP Summary’ report which is in the format required to re-upload page titles and descriptions. We simply require three headers for ‘URL’, ‘Title’ and ‘Description’.

For example –

serp-snippet-upload-format

You can upload in a .txt, .csv or Excel file.

  • Like us on Facebook
  • +1 us on Google Plus
  • Connect with us on LinkedIn
  • Follow us on Twitter
  • View our RSS feed

Download.

Download

Purchase a licence.

Purchase