SEO Spider Configuration

User Guide

Spider Crawl Tab

Images

Configuration > Spider > Crawl > Images

You can choose to store and crawl images independently.

Unticking the ‘store’ configuration will mean image files within an img element will not be stored and will not appear within the SEO Spider.

<img src="image.jpg">

Unticking the ‘crawl’ configuration will mean image files within an img element will not be crawled to check their response code.

Images linked to via any other means will still be stored and crawled, for example, using an anchor tag.

The exclude or custom robots.txt can be used for images linked in anchor tags.

Please read our guide on How To Find Missing Image Alt Text & Attributes.


CSS

Configuration > Spider > Crawl > CSS

This allows you to store and crawl CSS files independently.

Unticking the ‘store’ configuration will mean CSS files will not be stored and will not appear within the SEO Spider.

Unticking the ‘crawl’ configuration will mean stylesheets will not be crawled to check their response code.


JavaScript

Configuration > Spider > Crawl > JavaScript

You can choose to store and crawl JavaScript files independently.

Unticking the ‘store’ configuration will mean JavaScript files will not be stored and will not appear within the SEO Spider.

Unticking the ‘crawl’ configuration will mean JavaScript files will not be crawled to check their response code.


SWF

Configuration > Spider > Crawl > SWF

You can choose to store and crawl SWF (Adobe Flash File format) files independently.

Unticking the ‘store’ configuration will mean SWF files will not be stored and will not appear within the SEO Spider.

Unticking the ‘crawl’ configuration will mean SWF files will not be crawled to check their response code.


Canonicals

Configuration > Spider > Crawl > Canonicals

By default the SEO Spider will store and crawl canonicals (in canonical link elements or HTTP header) and use the links contained within for discovery.

Unticking the ‘store’ configuration will mean canonicals will not be stored and will not appear within the SEO Spider.

Unticking the ‘crawl’ configuration will mean URLs discovered in canonicals will not be crawled. If ‘store’ is selected only, then they will continue to be reported in the interface, but they just won’t be used for discovery.

Please read our guide on How To Audit Canonicals.


Pagination (rel next/prev)

Configuration > Spider > Crawl > Pagination (Rel Next/Prev)

By default the SEO Spider will not crawl rel=”next” and rel=”prev” attributes or use the links contained within it for discovery.

Unticking the ‘store’ configuration will mean rel=”next” and rel=”prev” attributes will not be stored and will not appear within the SEO Spider.

Unticking the ‘crawl’ configuration will mean URLs discovered in rel=”next” and rel=”prev” will not be crawled.

Please read our guide on How To Audit rel=”next” and rel=”prev” Pagination Attributes.


Hreflang

Configuration > Spider > Crawl > Hreflang

By default the SEO Spider will extract hreflang attributes and display hreflang language and region codes and the URL in the hreflang tab.

However, the URLs found in the hreflang attributes will not be crawled and used for discovery, unless ‘Crawl hreflang’ is ticked. With this setting enabled hreflang URLs’s will be extracted from an XML sitemap uploaded in list mode.

Unticking the ‘store’ configuration will mean hreflang attributes will not be stored and will not appear within the SEO Spider.

Unticking the ‘crawl’ configuration will mean URLs discovered in hreflang will not be crawled.

Please read our guide on How To Audit Hreflang.


Meta refresh

Configuration > Spider > Crawl > Meta Refresh

By default the SEO Spider will store and crawl URLs contained within a meta refresh.

<meta http-equiv="refresh" content="4; URL='www.screamingfrog.co.uk/meta-refresh-url'"/>

Unticking the ‘store’ configuration will mean meta refresh details will not be stored and will not appear within the SEO Spider.

Unticking the ‘crawl’ configuration will mean URLs discovered within a meta refresh will not be crawled.


iframes

Configuration > Spider > Crawl > iframes

By default the SEO Spider will store and crawl URLs contained within iframes.

<iframe src="https://www.screamingfrog.co.uk/iframe/"/>

Unticking the ‘store’ configuration will iframe details will not be stored and will not appear within the SEO Spider.

Unticking the ‘crawl’ configuration will mean URLs discovered within an iframe will not be crawled.


Crawl outside of start folder

Configuration > Spider > Crawl > Crawl Outside of Start Folder

By default the SEO Spider will only crawl the subfolder (or sub directory) you crawl from forwards. However, if you wish to start a crawl from a specific sub folder, but crawl the entire website, use this option.


Crawl all subdomains

Configuration > Spider > Crawl > Crawl All Subdomains

By default the SEO Spider will only crawl the subdomain you crawl from and treat all other subdomains encountered as external sites. These will only be crawled to a single level and shown under the External tab.

For example, if https://www.screamingfrog.co.uk is entered as the start URL, then other subdomains discovered in the crawl such as https://cdn.screamingfrog.co.uk or https://images.screamingfrog.co.uk will be treated as ‘external’, as well as other domains such as www.google.co.uk etc.

To crawl all subdomains of a root domain (such as https://cdn.screamingfrog.co.uk or https://images.screamingfrog.co.uk), then this configuration should be enabled.

The CDNs configuration option can be used to treat external URLs as internal.

Please note – If a crawl is started from the root, and a subdomain is not specified at the outset (for example, starting the crawl from https://screamingfrog.co.uk), then all subdomains will be crawled by default. This is similar to behaviour of a site: query in Google search.


Follow internal or external ‘nofollow’

Configuration > Spider > Crawl > Follow Internal/External “Nofollow”

By default the SEO Spider will not crawl internal or external links with the ‘nofollow’, ‘sponsored’ and ‘ugc’ attributes, or links from pages with the meta nofollow tag and nofollow in the X-Robots-Tag HTTP Header.

If you would like the SEO Spider to crawl these, simply enable this configuration option.


Crawl linked XML sitemaps

Configuration > Spider > Crawl > Crawl Linked XML Sitemaps

The SEO Spider will not crawl XML Sitemaps by default (in regular ‘Spider’ mode). To crawl XML Sitemaps and populate the filters in the Sitemaps tab, this configuration should be enabled.

When the ‘Crawl Linked XML Sitemaps’ configuration is enabled, you can choose to either ‘Auto Discover XML Sitemaps via robots.txt’, or supply a list of XML Sitemaps by ticking ‘Crawl These Sitemaps’, and pasting them into the field that appears.

Please note – Once the crawl has finished, a ‘Crawl Analysis‘ will need to be performed to populate the Sitemap filters. Please read our guide on ‘How To Audit XML Sitemaps‘.

Spider Extraction Tab

Page details

Configuration > Spider > Extraction > Page Details

The following on-page elements are configurable to be stored in the SEO Spider.

  • Page Titles
  • Meta Descriptions
  • Meta Keywords
  • H1
  • H2
  • Indexability (& Indexability Status)
  • Word Count
  • Readability
  • Text to Code Ratio
  • Hash Value
  • Page Size
  • Forms

Disabling any of the above options from being extracted will mean they will not appear within the SEO Spider interface in respective tabs, columns or filters.

Some filters and reports will obviously not work anymore if they are disabled. For example, if the ‘hash value’ is disabled, then the ‘URL > Duplicate’ filter will no longer be populated, as this uses the hash value as an algorithmic check for exact duplicate URLs.

A small amount of memory will be saved from not storing the data of each element.


URL details

Configuration > Spider > Extraction > URL Details

The following URL Details are configurable to be stored in the SEO Spider.

  • Response Time – Time in seconds to download the URL. More detailed information can be found in our FAQ.
  • Last-Modified – Read from the Last-Modified header in the servers HTTP response. If there server does not provide this the value will be empty.
  • HTTP Headers – This will store full HTTP request and response headers which can be seen in the lower ‘HTTP Headers’ tab. The full response headers are also included in the Internal tab to allow them to be queried alongside crawl data. They can be bulk exported via ‘Bulk Export > Web > All HTTP Headers’ and an aggregated report can be exported via ‘Reports > HTTP Header > HTTP Headers Summary.
  • Cookies – This will store cookies found during a crawl in the lower ‘Cookies’ tab. JavaScript rendering mode will need to be used to get an accurate view of cookies which are loaded on the page using JavaScript or pixel image tags. Cookies can be bulk exported via ‘Bulk Export > Web > All Cookies’ and an aggregated report can be exported via ‘Reports > Cookies > Cookie Summary. Please note, when you choose to store cookies, the auto exclusion performed by the SEO Spider for Google Analytics tracking tags is disabled to provide an accurate view of all cookies issued. This means it will affect your analytics reporting, unless you choose to exclude any tracking scripts from firing by using the exclude configuration (‘Config > Exclude’) or filter out the ‘Screaming Frog SEO Spider’ user-agent similar to excluding PSI.

Disabling any of the above options from being extracted will mean they will not appear within the SEO Spider interface in respective tabs and columns.

A small amount of memory will be saved from not storing the data of each element.


Directives

Configuration > Spider > Extraction > Directives

The following directives are configurable to be stored in the SEO Spider.

  • Meta Robots
  • X-Robots-Tag

Disabling any of the above options from being extracted will mean they will not appear within the SEO Spider interface in respective tabs, columns or filters.

A small amount of memory will be saved from not storing the data.


Structured data

Configuration > Spider > Extraction > Structured Data

Structured Data is entirely configurable to be stored in the SEO Spider. Please see our detailed guide on How To Test & Validate Structured Data, or continue reading below to understand more about the configuration options.

By default the SEO Spider will not extract and report on structured data. The following configuration options will need to be enabled for different structured data formats to appear within the ‘Structured Data’ tab.

  • JSON-LD – This configuration option enables the SEO Spider to extract JSON-LD structured data, and for it to appear under the ‘Structured Data’ tab.
  • Microdata – This configuration option enables the SEO Spider to extract Microdata structured data, and for it to appear under the ‘Structured Data’ tab.
  • RDFa – This configuration option enables the SEO Spider to extract RDFa structured data, and for it to appear under the ‘Structured Data’ tab.

You can also select to validate structured data, against Schema.org and Google rich result features.

Schema.org Validation

This configuration option is only available, if one or more of the structured data formats are enabled for extraction.

If enabled, then the SEO Spider will validate structured data against Schema.org specifications. It checks whether the types and properties exist and will show ‘errors’ for any issues encountered.

For example, it checks to see whether http://schema.org/author exists for a property, or http://schema.org/Book exist as a type. It validates against main and pending Schema vocabulary from their latest versions. The Structured Data tab and filter will show details of validation errors.

Additionally, this validation checks for out of date schema use of Data-Vocabulary.org.

Google Rich Result Feature Validation

This configuration option is only available, if one or more of the structured data formats are enabled for extraction.

If enabled, then the SEO Spider will validate structured data against Google rich result feature requirements according to their own documentation. Validation issues for required properties will be classed as errors, while issues around recommended properties will be classed as warnings, in the same way as Google’s own Structured Data Testing Tool.

The Structured Data tab and filter will show details of Google feature validation errors and warnings.

The full list of Google rich result features that the SEO Spider is able to validate against can be seen in our guide on How To Test & Validate Structured Data.


HTML

Configuration > Spider > Extraction > Store HTML / Rendered HTML

Store HTML

This allows you to save the static HTML of every URL crawled by the SEO Spider to disk, and view it in the ‘View Source’ lower window pane (on the left hand side, under ‘Original HTML’). They can be bulk exported via ‘Bulk Export > Web > All Page Source’.

This enables you to view the original HTML before JavaScript comes into play, in the same way as a right click ‘view source’ in a browser. This is great for debugging, or for comparing against the rendered HTML.

Store rendered HTML

This allows you to save the rendered HTML of every URL crawled by the SEO Spider to disk, and view in the ‘View Source’ lower window pane (on the right hand side, under ‘Rendered HTML’). They can be bulk exported via ‘Bulk Export > Web > All Page Source’.

This enables you to view the DOM like ‘inspect element’ (in Chrome in DevTools), after JavaScript has been processed.

Please note, this option will only work when JavaScript rendering is enabled.


PDF

Configuration > Spider > Extraction > PDF

Store PDF

This allows you to save PDFs to disk during a crawl. They can be bulk exported via ‘Bulk Export > Web > All PDF Documents’, or just the content can be exported as .txt files via ‘Bulk Export > Web > All PDF Content’.

When PDFs are stored, the PDF can be viewed in the ‘Rendered Page’ tab and the text content of the PDF can be viewed in the ‘View Source’ tab and ‘Visible Content’ filter.

Extract PDF Properties

By default the PDF title and keywords will be extracted. These will appear in the ‘Title’ and ‘Meta Keywords’ columns in the Internal tab of the SEO Spider.

Google will convert the PDF to HTML and use the PDF title as the title element and the keywords as meta keywords, although it doesn’t use meta keywords in scoring.

By enabling ‘Extract PDF properties’, the following additional properties will also be extracted.

  1. Subject
  2. Author
  3. Creation Date
  4. Modification Date
  5. Page Count
  6. Word Count

These new columns are displayed in the Internal tab.

Spider Limits Tab

Limit crawl total

Configuration > Spider > Limits > Limit Crawl Total

The free version of the software has a 500 URL crawl limit. If you have a licensed version of the tool this will be replaced with 5 million URLs, but you can include any number here for greater control over the number of pages you wish to crawl.


Limit crawl depth

Configuration > Spider > Limits > Limit Crawl Depth

You can choose how deep the SEO Spider crawls a site (in terms of links away from your chosen start point).


Limit URLs per crawl depth

Configuration > Spider > Limits > Limit URLs Per Crawl Depth

Control the number of URLs that are crawled at each crawl depth.


Limit max folder depth

Configuration > Spider > Limits > Limit Max Folder Depth

Control the number of folders (or subdirectories) the SEO Spider will crawl.

The Spider classifies folders as part of the URL path after the domain that end in a trailing slash:

  • https://www.screamingfrog.co.uk/ – folder depth 0
  • https://www.screamingfrog.co.uk/seo-spider/ – folder depth 1
  • https://www.screamingfrog.co.uk/seo-spider/#download – folder depth 1
  • https://www.screamingfrog.co.uk/seo-spider/fake-page.html – folder depth 1
  • https://www.screamingfrog.co.uk/seo-spider/user-guide/ – folder depth 2

Limit number of query strings

Configuration > Spider > Limits > Limit Number of Query Strings

Excludes from the crawl any URLs containing more than the configured number of query strings. e.g., if set to ‘2’, example.com/?query1&query2&query3 won’t be crawled.


Limit max URL length to crawl

Configuration > Spider > Limits > Limit Max URL Length

Control the length of URLs that the SEO Spider will crawl.

There’s a default max URL length of 2,000, due to the limits of the database storage.


Max redirects to follow

Configuration > Spider > Limits > Limit Max Redirects to Follow

This option provides the ability to control the number of redirects the SEO Spider will follow.


Limit by URL path

Configuration > Spider > Limits > Limit by URL Path

Control the number of URLs that are crawled by URL path. Enter a list of URL patterns and the maximum number of pages to crawl for each.

Spider Rendering Tab

Rendering

Configuration > Spider > Rendering

This configuration allows you to set the rendering mode for the crawl:

  • Text Only: The SEO Spider will crawl and extract from the raw HTML only. It ignores the AJAX Crawling Scheme, and client-side JavaScript.
  • Old AJAX Crawling Scheme: The SEO Spider will obey Google’s long deprecated AJAX Crawling Scheme if discovered. If not present, it will crawl the raw HTML like default ‘Text Only’ mode.
  • JavaScript: The SEO Spider will execute client-side JavaScript by rendering the page in its headless Chrome browser, crawling and extracting from the rendered HTML for content and links. Like Google, it will also discover any links in the raw HTML.

Please note: To emulate Googlebot as closely as possible our rendering engine uses the Chromium project. The following operating systems are supported:

  • Windows 10
  • Windows 11
  • Windows Server 2016
  • Windows Server 2022
  • Ubuntu 14.04+ (64-bit only)
  • macOS 11+

Please note: If you are running a supported OS and are still unable to use rendering, it could be you are running in compatibility mode.

To check this, go to your installation directory (C:\Program Files (x86)\Screaming Frog SEO Spider\), right click on ScreamingFrogSEOSpider.exe, select ‘Properties’, then the ‘Compatibility’ tab, and check you don’t have anything ticked under the ‘Compatibility Mode’ section.


Rendered page screen shots

Configuration > Spider > Rendering > JavaScript > Rendered Page Screenshots

This configuration is enabled by default when selecting JavaScript rendering and means screenshots are captured of rendered pages, which can be viewed in the ‘Rendered Page‘ tab, in the lower window pane.

You can select various window sizes from Googlebot desktop, Googlebot Smartphone and various other devices.

The rendered screenshots are viewable within the ‘C:\Users\User Name\.ScreamingFrogSEOSpider\screenshots-XXXXXXXXXXXXXXX’ folder, and can be exported via the ‘Bulk Export > Web > Screenshots’ top level menu, to save navigating, copying and pasting.


JavaScript error reporting

Configuration > Spider > Rendering > JavaScript > JavaScript Error Reporting

This setting enables JavaScript error reporting to be captured and reported under respective filters in the ‘JavaScript’ tab.

Detailed JavaScript errors, warnings and issues can be viewed in the lower ‘Chrome Console Log’ tab and bulk exported via ‘Bulk Export > JavaScript > Pages With JavaScript Issues’.


Flatten Shadow DOM

Configuration > Spider > Rendering > JavaScript > Flatten Shadow DOM

Google is able to flatten and index Shadow DOM content as part of the rendered HTML of a page. This configuration is enabled by default, but can be disabled.


Flatten iframes

Configuration > Spider > Rendering > JavaScript > Flatten iframes

Google will inline iframes into a div in the rendered HTML of a parent page, if conditions allow. These include the height being set, having a mobile viewport, and not being noindex. We try to mimic Google’s behaviour. This configuration is enabled by default, but can be disabled.


AJAX timeout

Configuration > Spider > Rendering > JavaScript > AJAX Timeout

This is how long, in seconds, the SEO Spider should allow JavaScript to execute before considering a page loaded. This timer starts after the Chromium browser has loaded the web page and any referenced resources, such as JS, CSS and Images.

In reality, Google is more flexible than the 5 second mark mentioned above, they adapt based upon how long a page takes to load content, considering network activity and things like caching play a part. However, Google obviously won’t wait forever, so content that you want to be crawled and indexed, needs to be available quickly, or it simply won’t be seen.

The 5 second rule is a reasonable rule of thumb for users, and Googlebot.


Window size

Configuration > Spider > Rendering > JavaScript > Window Size

This sets the viewport size in JavaScript rendering mode, which can be seen in the rendered page screen shots captured in the ‘Rendered Page‘ tab.

For both ‘Googlebot Mobile: Smartphone’ and ‘Googlebot Desktop’ window sizes, the SEO Spider emulates Googlebot behaviour and re-sizes the page – so it’s really long to capture as much data as possible. Google will stretch the page, to load and capture any additional content.

The SEO Spider will load the page with 411×731 pixels for mobile or 1024×768 pixels for desktop, and then re-size the length up to 8,192px. This is the limit we are currently able to capture in the in-built Chromium browser. Google are able to re-size up to a height of 12,140 pixels.

In rare cases the window size can influence the rendered HTML. For example some websites may not have certain elements on smaller viewports, this can impact results like the word count and links.

For other device window sizes, the viewport chosen will be used for rendering any content, links and screenshots – without resizing to a longer viewport.

Spider Advanced Tab

Ignore non-indexable URLs for Issues

Configuration > Spider > Advanced > Ignore Non-Indexable URLs for Issues

When enabled, the SEO Spider will only populate issue-related filters if the page is Indexable. This includes all filters under Page Titles, Meta Description, Meta Keywords, H1 and H2 tabs and the following other issues –

  • ‘Low Content Pages’ in the Content tab.
  • ‘Missing’, ‘Validation Errors’ and ‘Validation Warnings’ in the Structured Data tab.
  • ‘Orphan URLs’ in the Sitemaps tab.
  • ‘No GA Data’ in the Analytics tab.
  • ‘No Search Analytics Data’ in the Search Console tab.
  • ‘Pages With High Crawl Depth’ in the Links tab.

E.g. This means URLs won’t be considered as ‘Duplicate’, or ‘Over X Characters’ or ‘Below X Characters’ if for example they are set as ‘noindex’, and hence non-indexable.

We recommend disabling this feature if you’re crawling a staging website which has a sitewide noindex.


Ignore paginated URLs for duplicate filters

Configuration > Spider > Advanced > Ignore Paginated URLs for Duplicate Filters

When enabled, URLs with rel=”prev” in the sequence will not be considered for ‘Duplicate’ filters under Page Titles, Meta Description, Meta Keywords, H1 and H2 tabs. Only the first URL in the paginated sequence, with a rel=”next” attribute will be considered.

This means paginated URLs won’t be considered as having a ‘Duplicate’ page title with the first page in the series for example. It’s normal and expected behaviour and hence, this configuration means this will not be flagged as an issue.


Always follow redirects

Configuration > Spider > Advanced > Always Follow Redirects

This feature allows the SEO Spider to follow redirects until the final redirect target URL in list mode, ignoring crawl depth. This is particularly useful for site migrations, where URLs may perform a number of 3XX redirects, before they reach their final destination.

To view redirects in a site migration, we recommend using the ‘all redirects‘ report.

Please see our guide on ‘How To Use List Mode‘ for more information on how this configuration can be utilised.


Always follow canonicals

Configuration > Spider > Advanced > Always Follow Canonicals

This feature allows the SEO Spider to follow canonicals until the final redirect target URL in list mode, ignoring crawl depth. This is particularly useful for site migrations, where canonicals might be canonicalised multiple times, before they reach their final destination.

To view the chain of canonicals, we recommend enabling this configuration and using the ‘canonical chains‘ report.

Please see our guide on ‘How To Use List Mode‘ for more information on how this configuration can be utilised like ‘always follow redirects’.


Respect noindex

Configuration > Spider > Advanced > Respect Noindex

This option means URLs with ‘noindex’ will not be reported in the SEO Spider. These URLs will still be crawled and their outlinks followed, but they won’t appear within the tool.


Respect canonical

Configuration > Spider > Advanced > Respect Canonical

This option means URLs which have been canonicalised to another URL, will not be reported in the SEO Spider. These URLs will still be crawled and their outlinks followed, but they won’t appear within the tool.


Respect next/prev

Configuration > Spider > Advanced > Respect Next/Prev

This option means URLs with a rel=”prev” in the sequence, will not be reported in the SEO Spider. Only the first URL in the paginated sequence with a rel=”next” attribute will be reported.

These URLs will still be crawled and their outlinks followed, but they won’t appear within the tool.


Respect HSTS policy

Configuration > Spider > Advanced > Respect HSTS Policy

HTTP Strict Transport Security (HSTS) is a standard, defined in RFC 6797, by which a web server can declare to a client that it should only be accessed via HTTPS.

The client (in this case, the SEO Spider) will then make all future requests over HTTPS, even if following a link to an HTTP URL. When this happens the SEO Spider will show a Status Code of 307, a Status of “HSTS Policy” and Redirect Type of “HSTS Policy”.

You can disable this feature and see the ‘true’ status code behind a redirect (such as a 301 permanent redirect for example). Please see more details in our ‘An SEOs guide to Crawling HSTS & 307 Redirects‘ article.


Respect self referencing meta refresh

Configuration > Spider > Advanced > Respect Self Referencing Meta Refresh

You can disable the ‘Respect Self Referencing Meta Refresh’ configuration to stop self referencing meta refresh URLs being considered as ‘non-indexable’.

It’s fairly common for sites to have a self referencing meta refresh for various reasons, and generally this doesn’t impact indexing of the page. However, it should be investigated further, as it’s redirecting to itself, and this is why it’s flagged as ‘non-indexable’.


Extract images from img srcset attribute

Configuration > Spider > Advanced > Extract Images From IMG SRCSET Attribute

If enabled will extract images from the srcset attribute of the <img> tag. In the example below this would be image-1x.png and image-2x.png as well as image-src.png.

<img src="image-src.png" srcset="image-1x.png 1x, image-2x.png 2x" alt="Retina friendly images" />


Crawl fragment identifiers

Configuration > Spider > Advanced > Crawl Fragment Identifiers

If enabled the SEO Spider will crawl URLs with hash fragments and consider them as separate unique URLs.

https://www.screamingfrog.co.uk/#this-is-treated-as-a-separate-url/

By default, the SEO Spider will ignore anything from the hash value like a search engine. But this can be useful when analysing in-page jump links and bookmarks for example.


Response timeout

Configuration > Spider > Advanced > Response Timeout (secs)

The SEO Spider will wait 20 seconds to get any kind of HTTP response from a URL by default. You can increase the length of waiting time for very slow websites.


5XX response retries

Configuration > Spider > Advanced > 5XX Response Retries

This option provides the ability to automatically re-try 5XX responses. Often these responses can be temporary, so re-trying a URL may provide a 2XX response.

Spider Preferences Tab

Page title & meta description width

Configuration > Spider > Preferences > Page Title/Meta Description Width

This option provides the ability to control the character and pixel width limits in the SEO Spider filters in the page title and meta description tabs.

For example, changing the minimum pixel width default number of ‘200’ for page title width, would change the ‘Below 200 Pixels’ filter in the ‘Page Titles’ tab. This allows you to set your own character and pixel width based upon your own preferences.

Please note – This does not update the SERP Snippet preview at this time, only the filters within the tabs.


Other character preferences

Configuration > Spider > Preferences > Other

These options provide the ability to control the character length of URLs, h1, h2, image alt text, max image size and low content pages filters in their respective tabs.

For example, if the ‘Max Image Size Kilobytes’ was adjusted from 100 to ‘200’, then only images over 200kb would appear in the ‘Images > Over X kb’ tab and filter.

Other Configuration Options

Content area

Configuration > Content > Area

You can specify the content area used for word count, near duplicate content analysis and spelling and grammar checks. This can help focus analysis on the main content area of a page, avoiding known boilerplate text.

By default the SEO Spider will only consider text contained within the body HTML element of a web page. By default both the nav and footer HTML elements are excluded to help focus the content area used to the main content of the page.

However, not all websites are built using these HTML5 semantic elements, and sometimes it’s useful to refine the content area used in the analysis further. You’re able to add a list of HTML elements, classes or ID’s to exclude or include for the content used.

For example, the Screaming Frog website has a mobile menu outside the nav element, which is included within the content analysis by default. The mobile menu can be seen in the content preview of the ‘duplicate details’ tab shown below when checking for duplicate content (as well as the ‘Spelling & Grammar Details’ tab).

Near Duplicate Content Pre Content Settings Refinement

By right clicking and viewing source of the HTML of our website, we can see this menu has a ‘mobile-menu__dropdown’ class. The ‘mobile-menu__dropdown’ can then be excluded in the ‘Exclude Classes’ box –

Near Duplicate Content Area

The mobile menu is then removed from near duplicate analysis and the content shown in the duplicate details tab (as well as Spelling & Grammar and word counts).

Near Duplicate Content Settings Refined

Content area settings can be adjusted post-crawl for near duplicate content analysis and spelling and grammar. Near duplicates will require ‘crawl analysis‘ to be re-run to update the results, and spelling and grammar requires its analysis to be refreshed via the right hand ‘Spelling & Grammar’ tab or lower window ‘Spelling & Grammar Details’ tab.

Please see our tutorials on finding duplicate content and spelling and grammar checking.


Duplicates

Configuration > Content > Duplicates

The SEO Spider is able to find exact duplicates where pages are identical to each other, and near duplicates where some content matches between different pages. Both of these can be viewed in the ‘Content’ tab and corresponding ‘Exact Duplicates’ and ‘Near Duplicates’ filters.

Near Duplicates

Exact duplicate pages are discovered by default. To check for ‘near duplicates’ the configuration must be enabled, so that it allows the SEO Spider to store the content of each page.

Near Duplicates

The SEO Spider will identify near duplicates with a 90% similarity match using a minhash algorithm, which can be adjusted to find content with a lower similarity threshold.

The SEO Spider will also only check ‘Indexable’ pages for duplicates (for both exact and near duplicates).

This means if you have two URLs that are the same, but one is canonicalised to the other (and therefore ‘non-indexable’), this won’t be reported – unless this option is disabled.

Near duplicates requires post crawl analysis to be populated, and more detail on the duplicates can be seen in the ‘Duplicate Details’ lower tab. This displays every near duplicate URL identified, and their similarity match.

Duplicate Details Tab

Clicking on a ‘Near Duplicate Address’ in the ‘Duplicate Details’ tab will also display the near duplicate content discovered between the pages and highlight the differences.

Duplicate Content Differences

The content area used for near duplicate analysis can be adjusted via ‘Configuration > Content > Area’. You’re able to add a list of HTML elements, classes or ID’s to exclude or include for the content used.

The near duplicate content threshold and content area used in the analysis can both be updated post crawl and crawl analysis can be re-run to refine the results, without the need for re-crawling.


Spelling & grammar

Configuration > Content > Spelling & Grammar

The SEO Spider is able to perform a spelling and grammar check on HTML pages in a crawl. Other content types are currently not supported, but might be in the future.

Spelling & Grammar Checks

The spelling and and grammar checks are disabled by default and need to be enabled for spelling and grammar errors to be displayed in the ‘Content’ tab, and corresponding ‘Spelling Errors’ and ‘Grammar Errors’ filters.

Enable Spelling & Grammar Checks

The spelling and grammar feature will auto identify the language used on a page (via the HTML language attribute), but also allow you to manually select language where required within the configuration.

Spelling & Grammar Check Language

It supports 39 languages, which include –

  • Arabic
  • Asturian
  • Belarusian
  • Breton
  • Catalan
  • Chinese
  • Danish
  • Dutch
  • English (Australia, Canada, New Zealand, South Africa, USA, UK)
  • French
  • Galician
  • German (Austria, Germany, Switzerland)
  • Greek
  • Italian
  • Japanese
  • Khmer
  • Persian (Afghanistan, Iran)
  • Polish
  • Portuguese (Angola, Brazil, Mozambique, Portgual)
  • Romanian
  • Russian
  • Slovak
  • Solvenian
  • Spanish
  • Swedish
  • Tagalog
  • Tamil
  • Ukranian

Please see our FAQ if you’d like to see a new language supported for spelling and grammar.

The lower window ‘Spelling & Grammar Details’ tab shows the error, type (spelling or grammar), detail, and provides a suggestion to correct the issue. The right hand-side of the details tab also show a visual of the text from the page and errors identified.

The right-hand pane ‘Spelling & Grammar’ tab displays the top 100 unique errors discovered and the number of URLs it affects. This can be helpful for finding errors across templates, and for building your dictionary or ignore list. You can right click and choose to ‘Ignore grammar rule’, ‘Ignore All’, or ‘Add to Dictionary’ where relevant.

Top 100 Errors Spelling & Grammar

Spelling & Grammar Configurations

The ‘grammar rules’ configuration allows you to enable and disable specific grammar rules used. You’re able to right click and ‘Ignore grammar rule’ on specific grammar issues identified during a crawl.

The ‘Ignore’ configuration allows you to ignore a list of words for a crawl. This is only for a specific crawl, and not remembered accross all crawls. You’re able to right click and ‘Ignore All’ on spelling errors discovered during a crawl.

The ‘dictionary’ allows you to ignore a list of words for every crawl performed. This list is stored against the relevant dictionary, and remembered for all crawls performed. Words can be added and removed at anytime for each dictionary. You’re able to right click and ‘Add to Dictionary’ on spelling errors identified in a crawl.

The content area used for spelling and grammar can be adjusted via ‘Configuration > Content > Area’. You’re able to add a list of HTML elements, classes or ID’s to exclude or include for the content analysed.

Grammar rules, ignore words, dictionary and content area settings used in the analysis can all be updated post crawl (or when paused) and the spelling and grammar checks can be re-run to refine the results, without the need for re-crawling.

Re-run Spelling & Grammar Checker

Robots.txt

Configuration > Robots.txt

By default the SEO Spider will obey robots.txt protocol and is set to ‘Respect robots.txt’. This means the SEO Spider will not be able to crawl a site if its disallowed via robots.txt.

This setting can be adjusted to ‘Ignore robots.txt’, or ‘Ignore robots.xt but report status’.

Ignore robots.txt

The ‘Ignore robots.txt’ option allows you to ignore this protocol, which is down to the responsibility of the user. This option actually means the SEO Spider will not even download the robots.txt file. So it also means all robots directives will be completely ignored.

Ignore robots.xt but report status

The ‘Ignore robots.txt, but report status’ configuration means the robots.txt of websites is downloaded and reported in the SEO Spider. However, the directives within it are ignored. This allows you to crawl the website, but still see which pages should be blocked from crawling.

Show Internal URLs Blocked By Robots.txt

By default internal URLs blocked by robots.txt will be shown in the ‘Internal’ tab with Status Code of ‘0’ and Status ‘Blocked by Robots.txt’. To hide these URLs in the interface deselect this option. This option is not available if ‘Ignore robots.txt’ is checked.

You can also view internal URLs blocked by robots.txt under the ‘Response Codes’ tab and ‘Blocked by Robots.txt’ filter. This will also show the robots.txt directive (‘matched robots.txt line’ column) of the disallow against each URL that is blocked.

Show External URLs Blocked By Robots.txt

By default external URLs blocked by robots.txt are hidden. To display these in the External tab with Status Code ‘0’ and Status ‘Blocked by Robots.txt’ check this option. This option is not available if ‘Ignore robots.txt’ is checked.

You can also view external URLs blocked by robots.txt under the ‘Response Codes’ tab and ‘Blocked by Robots.txt’ filter. This will also show robots.txt directive (‘matched robots.txt line column’) of the disallow against each URL that is blocked.


Custom Robots

You can download, edit and test a site’s robots.txt using the custom robots.txt feature which will override the live version on the site for the crawl. It will not update the live robots.txt on the site.

This feature allows you to add multiple robots.txt at subdomain level, test directives in the SEO Spider and view URLs which are blocked or allowed. The custom robots.txt uses the selected user-agent in the configuration.

custom robots.txt

During a crawl you can filter blocked URLs based upon the custom robots.txt (‘Response Codes > Blocked by robots.txt’) and see the matching robots.txt directive line.

URLs blocked by robots.txt

Please read our featured user guide using the SEO Spider as a robots.txt tester.

Please note – As mentioned above, the changes you make to the robots.txt within the SEO Spider, do not impact your live robots.txt uploaded to your server. You can however copy and paste these into the live version manually to update your live directives.


URL rewriting

Configuration > URL Rewriting

The URL rewriting feature allows you to rewrite URLs on the fly. For the majority of cases, the ‘remove parameters’ and common options (under ‘options’) will suffice. However, we do also offer an advanced regex replace feature which provides further control.

URL rewriting is only applied to URLs discovered in the course of crawling a website, not URLs that are entered as the start of a crawl in ‘Spider’ mode, or as part of a set of URLs in ‘List’ mode.

Remove Parameters

This feature allows you to automatically remove parameters in URLs. This is extremely useful for websites with session IDs, Google Analytics tracking or lots of parameters which you wish to remove. For example –

If the website has session IDs which make the URLs appear something like this ‘example.com/?sid=random-string-of-characters’. To remove the session ID, you just need to add ‘sid’ (without the apostrophes) within the ‘parameters’ field in the ‘remove parameters’ tab.

remove parameters, like session IDs yo

The SEO Spider will then automatically strip the session ID from the URL. You can test to see how a URL will be rewritten by our SEO Spider under the ‘test’ tab.

url rewriting test

This feature can also be used for removing Google Analytics tracking parameters. For example, you can just include the following under ‘remove parameters’ –

utm_source
utm_medium
utm_campaign

This will strip the standard tracking parameters from URLs.

Regex Replace

This advanced feature runs against each URL found during a crawl or in list mode. It replaces each substring of a URL that matches the regex with the given replace string. The “Regex Replace” feature can be tested in the “Test” tab of the “URL Rewriting” configuration window.

url rewriting HTTP to HTTPS

Examples are:

1) Changing all links from HTTP to HTTPS

Regex: http
Replace: https

2) Changing all links to example.com to be example.co.uk

Regex: .com
Replace: .co.uk

3) Making all links containing page=number to a fixed number, eg

www.example.com/page.php?page=1
www.example.com/page.php?page=2
www.example.com/page.php?page=3
www.example.com/page.php?page=4

To make all these go to www.example.com/page.php?page=1

Regex: page=\d+
Replace: page=1

4) Removing the www. domain from any URL by using an empty ‘Replace’. If you want to remove a query string parameter, please use the “Remove Parameters” feature – Regex is not the correct tool for this job!

Regex: www.
Replace:

5) Stripping all parameters

Regex: \?.*
Replace:

6) Changing links for only subdomains of example.com from HTTP to HTTPS

Regex: http://(.*example.com)
Replace: https://$1

7) Removing the anything after the hash value in JavaScript rendering mode

Regex: #.*
Replace:

8) Adding parameters to URLs

Regex: $
Replace: ?parameter=value

This will add ‘?parameter=value’ to the end of any URL encountered

In situations where the site already has parameters this requires more complicated expressions for the parameter to be added correctly:

Regex: (.*?\?.*)
Replace: $1&parameter=value

Regex: (^((?!\?).)*$)
Replace: $1?parameter=value

These must be entered in the order above or this will not work when adding the new parameter to existing query strings.

Options

We will include common options under this section. The ‘lowercase discovered URLs’ option does exactly that, it converts all URLs crawled into lowercase which can be useful for websites with case sensitivity issues in URLs.


CDNs

Configuration > CDNs

The CDNs feature allows you to enter a list of CDNs to be treated as ‘Internal’ during the crawl.

You’re able to supply a list of domains to be treated as internal. You can also supply a subfolder with the domain, for the subfolder (and contents within) to be treated as internal.

‘Internal’ links are then included in the ‘Internal’ tab, rather than ‘external’ and more details are extracted from them.


Include

Configuration > Include

This feature allows you to control which URL path the SEO Spider will crawl using partial regex matching. It narrows the default search by only crawling the URLs that match the regex which is particularly useful for larger sites, or sites with less intuitive URL structures. Matching is performed on the encoded version of the URL.

The page that you start the crawl from must have an outbound link which matches the regex for this feature to work, or it just won’t crawl onwards. If there is not a URL which matches the regex from the start page, the SEO Spider will not crawl anything!

  • As an example, if you wanted to crawl pages from https://www.screamingfrog.co.uk which have ‘search’ in the URL string you would simply include the regex: search in the ‘include’ feature. This would find the /search-engine-marketing/ and /search-engine-optimisation/ pages as they both have ‘search’ in them.

Check out our video guide on the include feature.

Troubleshooting

  • Matching is performed on the URL encoded address, you can see what this is in the URL Info tab in the lower window pane or respective column in the Internal tab.
  • The regular expression must match the whole URL, not just part of it.
  • If you experience just a single URL being crawled and then the crawl stopping, check your outbound links from that page. If you crawl http://www.example.com/ with an include of ‘/news/’ and only 1 URL is crawled, then it will be because http://www.example.com/ does not have any links to the news section of the site.

Exclude

Configuration > Exclude

The exclude configuration allows you to exclude URLs from a crawl by using partial regex matching. A URL that matches an exclude is not crawled at all (it’s not just ‘hidden’ in the interface). This will mean other URLs that do not match the exclude, but can only be reached from an excluded page will also not be found in the crawl.

The exclude list is applied to new URLs that are discovered during the crawl. This exclude list does not get applied to the initial URL(s) supplied in crawl or list mode.

Changing the exclude list during a crawl will affect newly discovered URLs and it will applied retrospectively to the list of pending URLs, but not update those already crawled.

Matching is performed on the URL encoded version of the URL. You can see the encoded version of a URL by selecting it in the main window then in the lower window pane in the details tab looking at the ‘URL Details’ tab, and the value second row labelled “URL Encoded Address”.

Here are some common examples –

  • To exclude a specific URL or page the syntax is:
    http://www.example.com/do-not-crawl-this-page.html
  • To exclude a sub directory or folder the syntax is:
    http://www.example.com/do-not-crawl-this-folder/
  • To exclude everything after brand where there can sometimes be other folders before:
    http://www.example.com/.*/brand.*
  • If you wish to exclude URLs with a certain parameter such as ‘?price’ contained in a variety of different directories you can simply use (Note the ? is a special character in regex and must be escaped with a backslash):
    \?price
  • To exclude anything with a question mark ‘?’(Note the ? is a special character in regex and must be escaped with a backslash):
    \?
  • If you wanted to exclude all files ending jpg, the regex would be:
    jpg$
  • If you wanted to exclude all URLs with 1 or more digits in a folder such as ‘/1/’ or ‘/999/’:
    /\d+/$
  • If you wanted to exclude all URLs ending with a random 6 digit number after a hyphen such as ‘-402001’, the regex would be:
    -[0-9]{6}$
  • If you wanted to exclude any URL with ‘exclude’ within them, the regex would be:
    exclude
  • Secure (https) pages would be:
    https
  • Excluding all pages on http://www.domain.com would be:
    http://www.domain.com/
  • If you want to exclude a URL and it doesn’t seem to be working, its probably because it contains special regex characters such as ?. Rather trying to locate and escape these individually, you can escape the whole line starting with \Q and ending with \E as follow:
    \Qhttp://www.example.com/test.php?product=special\E
  • Remember to use the encoded version of the URL. So if you wanted to exclude any URLs with a pipe |, it would be:
    %7C
  • If you’re extracting cookies, which removes the auto exclude for Google Analytics tracking tags, you could stop them from firing by including:
    google-analytics.com

Check out our video guide on the exclude feature.


Speed

Configuration > Speed

The speed configuration allows you to control the speed of the SEO Spider, either by number of concurrent threads, or by URLs requested per second.

When reducing speed, it’s always easier to control by the ‘Max URI/s’ option, which is the maximum number of URL requests per second. For example, the screenshot below would mean crawling at 1 URL per second –

SEO Spider Configuration

The ‘Max Threads’ option can simply be left alone when you throttle speed via URLs per second.

Increasing the number of threads allows you to significantly increase the speed of the SEO Spider. By default the SEO Spider crawls at 5 threads, to not overload servers.

Please use the threads configuration responsibly, as setting the number of threads high to increase the speed of the crawl will increase the number of HTTP requests made to the server and can impact a site’s response times. In very extreme cases, you could overload a server and crash it.

We recommend approving a crawl rate and time with the webmaster first, monitoring response times and adjusting the default speed if there are any issues.


User agent

Configuration > User-Agent

The user-agent configuration allows you to switch the user-agent of the HTTP requests made by the SEO Spider. By default the SEO Spider makes requests using its own ‘Screaming Frog SEO Spider user-agent string.

However, it has inbuilt preset user agents for Googlebot, Bingbot, various browsers and more. This allows you to switch between them quickly when required. This feature also has a custom user-agent setting which allows you to specify your own user agent.

Details on how the SEO Spider handles robots.txt can be found here.


HTTP header

Configuration > HTTP Header

The HTTP Header configuration allows you to supply completely custom header requests during a crawl.

Custom HTTP Headers

This means you’re able to set anything from accept-language, cookie, referer, or just supplying any unique header name. For example, there are scenarios where you may wish to supply an Accept-Language HTTP header in the SEO Spider’s request to crawl locale-adaptive content.

You can choose to supply any language and region pair that you require within the header value field.

User-agent is configured separately from other headers via ‘Configuration > User-Agent’.


Custom extraction

Configuration > Custom > Extraction

Custom extraction allows you to collect any data from the HTML of a URL. Extraction is performed on the static HTML returned by internal HTML pages with a 2XX response code. You can switch to JavaScript rendering mode to extract data from the rendered HTML (for any data that’s client-side only).

The SEO Spider supports the following modes to perform data extraction:

  • XPath: XPath selectors, including attributes.
  • CSS Path: CSS Path and optional attribute.
  • Regex: For more advanced uses, such as scraping HTML comments or inline JavaScript.

When using XPath or CSS Path to collect HTML, you can choose what to extract:

  • Extract HTML Element: The selected element and its inner HTML content.
  • Extract Inner HTML: The inner HTML content of the selected element. If the selected element contains other HTML elements, they will be included.
  • Extract Text: The text content of the selected element and the text content of any sub elements.
  • Function Value: The result of the supplied function, eg count(//h1) to find the number of h1 tags on a page.

To set up custom extraction, click ‘Config > Custom > Custom Extraction’.

Custom Extraction

Just click ‘Add’ to start setting up an extractor.

web scraping custom extractor

Then insert the relevant expression to scrape data. Up to 100 separate extractors can be configured to scrape data from a website with a limit of up to 1,000 extractions across all extractors.

Web scraping with custom extraction

If you’re unfamiliar with XPath, CSSPath and regex, you can use the visual custom extraction feature to select elements to scrape using an inbuilt browser. Click on the ‘browser’ icon next to the extractor.

Launch Visual Custom Extraction

Enter a URL you wish to extract data from in the URL bar and select the element you wish to scrape.

Scraping an author name

The SEO Spider will then highlight the area on the page, and create a variety of suggested expressions, and a preview of what will be extracted based upon the raw or rendered HTML. In this case, an author name from a blog post.

The data extracted can be viewed in the Custom Extraction tab Extracted data is also included as columns within the ‘Internal’ tab as well.

web scraping results

Please read our SEO Spider web scraping guide for a full tutorial on how to use custom extraction. For examples of custom extraction expressions, please see our XPath Examples and Regex Examples.

Regex Troubleshooting

  • The SEO Spider does not pre process HTML before running regexes. Please bear in mind however that the HTML you see in a browser when viewing source maybe different to what the SEO Spider sees. This can be caused by the web site returning different content based on User-Agent or Cookies, or if the pages content is generated using JavaScript and you are not using JavaScript rendering.
  • More details on the regex engine used by the SEO Spider can be found here.
  • The regex engine is configured such that the dot character matches newlines.
  • Regular Expressions, depending on how they are crafted, and the HTML they are run against, can be slow. This will have the affect of slowing the crawl down.

Google Analytics integration

Configuration > API Access > Google Universal Analytics / Google Analytics 4

You can connect to the Google Universal Analytics API and GA4 API and pull in data directly during a crawl. The SEO Spider can fetch user and session metrics, as well as goal conversions and ecommerce (transactions and revenue) data for landing pages, so you can view your top performing pages when performing a technical or content audit.

To set this up, start the SEO Spider and go to ‘Configuration > API Access’ and choose ‘Google Universal Analytics’ or ‘Google Analytics 4’.

Next, connect to a Google account (which has access to the Analytics account you wish to query) by granting the ‘Screaming Frog SEO Spider’ app permission to access your account to retrieve the data.

Google APIs use the OAuth 2.0 protocol for authentication and authorisation. The SEO Spider will remember any Google accounts you authorise within the list, so you can ‘connect’ quickly upon starting the application each time.

GA4 Login

Once connected in Universal Analytics, you can choose the relevant Google Analytics account, property, view, segment and date range.

Universal Analytics Account, Property and View selection

For GA4, you can select the analytics account, property and Data Stream.

GA4 Data Streams

Then simply select the metrics that you wish to fetch for Universal Analytics –

Universal Analytics Metrics

Or for GA4 –

GA4 Metrics

By default the SEO Spider collects the following 11 metrics in Universal Analytics –

  1. Sessions
  2. % New Sessions
  3. New Users
  4. Users
  5. Bounce Rate
  6. Page Views Per Session
  7. Avg Session Duration
  8. Page Value
  9. Goal Conversion Rate
  10. Goal Completions All
  11. Goal Value All

For UA you can select up to 30 metrics at a time from their API.

By default the SEO Spider collects the following 7 metrics in GA4 –

  1. Sessions
  2. Engaged Sessions
  3. Engagement Rate
  4. Views
  5. Conversions
  6. Event Count
  7. Total Revenue

For GA4 you can select up to 65 metrics available via their API.

You can read more about the metrics available and the definition of each metric from Google for Universal Analytics and GA4.

You can also set the dimension of each individual metric against either full page URL (‘Page Path’ in UA), or landing page, which are quite different (and both useful depending on your scenario and objectives).

Google Analytics Dimensions

For GA4 there is also a ‘filters’ tab, which allows you to select additional dimensions. For example, you can choose first user or session channel grouping with dimension values, such as ‘organic search’ to refine to a specific channel.

GA4 Filters

There are scenarios where URLs in Google Analytics might not match URLs in a crawl, so these are covered by auto matching trailing and non-trailing slash URLs and case sensitivity (upper and lowercase characters in URLs). Google doesn’t pass the protocol (HTTP or HTTPS) via their API, so these are also matched automatically.

Google Analytics General Config

When selecting either of the above options, please note that data from Google Analytics is sorted by sessions, so matching is performed against the URL with the highest number of sessions. Data is not aggregated for those URLs.

The following options are available –

  • Match Trailing and Non-Trailing Slash URLs – Allows both http://example.com/contact and http://example.com/contact/ to match either http://example.com/contact or http://example.com/contact/ from GA, whichever has the highest number of sessions.
  • Match Uppercase & Lowercase URLs – Allows http://example.com/contact.html, http://example.com/Contact.html and http://example.com/CONTACT.html to match the version of this URL from GA with the highest number of sessions.
  • Limit Max Results – If you have hundreds of thousands of URLs in GA, you can choose to limit the number of URLs to query, which is by default ordered by sessions to return the top performing page data of the top 100,000 URLs.
  • Crawl New URLs Discovered in Google Analytics – This means any new URLs discovered in Google Analytics (that are not found via hyperlinks) will be crawled. If this option isn’t enabled, then new URLs discovered via Google Analytics will only be available to view in the ‘Orphan Pages’ report. They won’t be added to the crawl queue, viewable within the user interface and appear under the respective tabs and filters. Please see our guide on finding orphan pages.

Google Analytics data will be fetched and display in respective columns within the ‘Internal’ and ‘Analytics’ tabs.

There’s an ‘API’ progress bar in the top right and when this has reached 100%, analytics data will start appearing against URLs in real-time. The more URLs and metrics queried the longer this process can take, but generally it’s extremely quick.

Google Analytics Data populating in the SEO Spider

There are 5 filters currently under the ‘Analytics’ tab, which allow you to filter the Google Analytics data –

  • Sessions Above 0 – This simply means the URL in question has 1 or more sessions.
  • Bounce Rate Above 70% – This means the URL has a bounce rate over 70%, which you may wish to investigate. In some scenarios this is normal though!
  • No GA Data – This means that for the metrics and dimensions queried, the Google API didn’t return any data for the URLs in the crawl. So the URLs either didn’t receive any visits sessions, or perhaps the URLs in the crawl are just different to those in GA for some reason.
  • Non-Indexable with GA Data – This means the URL is non-indexable, but still has data from GA.
  • Orphan URLs – This means the URL was only discovered via GA, and was not found via an internal link during the crawl.

Please read the following FAQs for various issues with accessing Google Analytics data in the SEO Spider –

  1. Why do I receive an error when granting access to my Google account?
  2. Why does my connection to Google Analytics fail?
  3. Why doesn’t GA data populate against my URLs?
  4. Why doesn’t the GA API data in the SEO Spider match what’s reported in the GA interface?
  5. Why can’t I see GA4 properties when I connect my Google Analytics account?

Please note, Google APIs use the OAuth 2.0 protocol for authentication and authorisation, and the data provided via Google Analytics and other APIs is only accessible locally on your machine. We cannot view and do not store that data ourselves. Please see more in our FAQ.

Using the Google Analytics 4 API is subject to their standard property quotas for core tokens.


Google Search Console integration

Configuration > API Access > Google Search Console

You can connect to the Google Search Analytics and URL Inspection APIs and pull in data directly during a crawl.

By default the SEO Spider will fetch impressions, clicks, CTR and position metrics from the Search Analytics API, so you can view your top performing pages when performing a technical or content audit.

Optionally, you can also choose to ‘Enable URL Inspection’ alongside Search Analytics data, which provides Google index status data for up to 2,000 URLs per property a day. This includes whether the ‘URL is on Google’, or ‘URL is not on Google’ and coverage.

Search Console data

To set this up, go to ‘Configuration > API Access > Google Search Console’. Connecting to Google Search Console works in the same way as already detailed in our step-by-step Google Analytics integration guide.

Connect to a Google account (which has access to the Search Console account you wish to query) by granting the ‘Screaming Frog SEO Spider’ app permission to access your account to retrieve the data. Google APIs use the OAuth 2.0 protocol for authentication and authorisation. The SEO Spider will remember any Google accounts you authorise within the list, so you can ‘connect’ quickly upon starting the application each time.

Once you have connected, you can choose the relevant website property.

Google Search Console integration

By default the SEO Spider collects the following metrics for the last 30 days –

  • Clicks
  • Impressions
  • CTR
  • Position

Read more about the definition of each metric from Google.

If you click the ‘Search Analytics’ tab in the configuration, you can adjust the date range, dimensions and various other settings.

Google Search Console Search Analytics configuration

If you wish to crawl new URLs discovered from Google Search Console to find any potential orphan pages, remember to enable the configuration shown below.

Orphan urls in search console

Optionally, you can navigate to the ‘URL Inspection’ tab and ‘Enable URL Inspection’ to collect data about the indexed status of up to 2,000 URLs in the crawl.

Google Search Console URL Inspection API integration

The SEO Spider crawls breadth-first by default, meaning via crawl depth from the start page of the crawl. The first 2k HTML URLs discovered will be queried, so focus the crawl on specific sections, use the configration for include and exclude, or list mode to get the data on key URLs and templates you need.

The following configuration options are available –

  • Ignore Non-Indexable URLs for URL Inspection – This means any URLs in the crawl that are classed as ‘Non-Indexable’, won’t be queried via the API. Only Indexable URLs will be queried, which can help save on your inspection quota if you’re confident on your sites set-up.
  • Use Multiple Properties – If multiple properties are verified for the same domain the SEO Spider will automatically detect all relevant properties in the account, and use the most specific property to request data for the URL. This means it’s now possible to get far more than 2k URLs with URL Inspection API data in a single crawl, if there are multiple properties set up – without having to perform multiple crawls.

The URL Inspection API includes the following data.

  • Summary – A top level verdict on whether the URL is indexed and eligible to display in the Google search results. ‘URL is on Google’ means the URL has been indexed, can appear in Google Search results, and no problems were found with any enhancements found in the page (rich results, mobile, AMP). ‘URL is on Google, but has Issues’ means it has been indexed and can appear in Google Search results, but there are some problems with mobile usability, AMP or Rich results that might mean it doesn’t appear in an optimal way. ‘URL is not on Google’ means it is not indexed by Google and won’t appear in the search results. This filter can include non-indexable URLs (such as those that are ‘noindex’) as well as Indexable URLs that are able to be indexed.
  • Coverage – A short, descriptive reason for the status of the URL, explaining why the URL is or isn’t on Google.
  • Last Crawl – The last time this page was crawled by Google, in your local time. All information shown in this tool is derived from this last crawled version.
  • Crawled As – The user agent type used for the crawl (desktop or mobile).
  • Crawl Allowed – Indicates whether your site allowed Google to crawl (visit) the page or blocked it with a robots.txt rule.
  • Page Fetch – Whether or not Google could actually get the page from your server. If crawling is not allowed, this field will show a failure.
  • Indexing Allowed – Whether or not your page explicitly disallowed indexing. If indexing is disallowed, the reason is explained, and the page won’t appear in Google Search results.
  • User-Declared Canonical – If your page explicitly declares a canonical URL, it will be shown here.
  • Google-Selected Canonical – The page that Google selected as the canonical (authoritative) URL, when it found similar or duplicate pages on your site.
  • Mobile Usability – Whether the page is mobile friendly or not.
  • Mobile Usability Issues – If the ‘page is not mobile friendly’, this column will display a list of mobile usability errors.
  • AMP Results – A verdict on whether the AMP URL is valid, invalid or has warnings. ‘Valid’ means the AMP URL is valid and indexed. ‘Invalid’ means the AMP URL has an error that will prevent it from being indexed. ‘Valid with warnings’ means the AMP URL can be indexed, but there are some issues that might prevent it from getting full features, or it uses tags or attributes that are deprecated, and might become invalid in the future.
  • AMP Issues – If the URL has AMP issues, this column will display a list of AMP errors.
  • Rich Results – A verdict on whether Rich results found on the page are valid, invalid or has warnings. ‘Valid’ means rich results have been found and are eligible for search. ‘Invalid’ means one or more rich results on the page has an error that will prevent it from being eligible for search. ‘Valid with warnings’ means the rich results on the page are eligible for search, but there are some issues that might prevent it from getting full features.
  • Rich Results Types – A comma separated list of all rich result enhancements discovered on the page.
  • Rich Results Types Errors – A comma separated list of all rich result enhancements discovered with an error on the page. To export specific errors discovered, use the ‘Bulk Export > URL Inspection > Rich Results’ export.
  • Rich Results Warnings – A comma separated list of all rich result enhancements discovered with a warning on the page. To export specific warnings discovered, use the ‘Bulk Export > URL Inspection > Rich Results’ export.

You can read more about the the indexed URL results from Google.

There are 11 filters under the ‘Search Console’ tab, which allow you to filter Google Search Console data from both APIs.

  • Clicks Above 0 – This simply means the URL in question has 1 or more clicks.
  • No Search Analytics Data – This means that the Search Analytics API didn’t return any data for the URLs in the crawl. So the URLs either didn’t receive any impressions, or perhaps the URLs in the crawl are just different to those in GSC for some reason.
  • Non-Indexable with Search Analytics Data – URLs that are classed as non-indexable, but have Google Search Analytics data.
  • Orphan URLs – URLs that have been discovered via Google Search Analytics, rather than internal links during a crawl. This filter requires ‘Crawl New URLs Discovered In Google Search Console’ to be enabled under the ‘General’ tab of the Google Search Console configuration window (Configuration > API Access > Google Search Console) and post ‘crawl analysis‘ to be populated. Please see our guide on how to find orphan pages.
  • URL Is Not on Google – The URL is not indexed by Google and won’t appear in the search results. This filter can include non-indexable URLs (such as those that are ‘noindex’) as well as Indexable URLs that are able to be indexed. It’s a catch all filter for anything not on Google according to the API.
  • Indexable URL Not Indexed – Indexable URLs found in the crawl that are not indexed by Google and won’t appear in the search results. This can include URLs that are unknown to Google, or those that have been discovered but not indexed, and more.
  • URL is on Google, But Has Issues – The URL has been indexed and can appear in Google Search results, but there are some problems with mobile usability, AMP or Rich results that might mean it doesn’t appear in an optimal way.
  • User-Declared Canonical Not Selected – Google has chosen to index a different URL to the one declared by the user in the HTML. Canonicals are hints, and sometimes Google does a great job of this, other times it’s less than ideal.
  • Page Is Not Mobile Friendly – The page has issues on mobile devices.
  • AMP URL Is Invalid – The AMP has an error that will prevent it from being indexed.
  • Rich Result Invalid – The URL has an error with one or more rich result enhancements that will prevent the rich result from showing in the Google search results. To export specific errors discovered, use the ‘Bulk Export > URL Inspection > Rich Results’ export.

Please see our tutorial on ‘How To Automate The URL Inspection API‘.


PageSpeed Insights integration

Configuration > API Access > PageSpeed Insights

You can connect to the Google PageSpeed Insights API and pull in data directly during a crawl.

PageSpeed Insights uses Lighthouse, so the SEO Spider is able to display Lighthouse speed metrics, analyse speed opportunities and diagnostics at scale and gather real-world data from the Chrome User Experience Report (CrUX) which contains Core Web Vitals from real-user monitoring (RUM).

To set this up, start the SEO Spider and go to ‘Configuration > API Access > PageSpeed Insights’, enter a free PageSpeed Insights API key, choose your metrics, connect and crawl.

Setting Up A PageSpeed Insights API Key

To set-up a free PageSpeed Insights API key, login to your Google account and then visit the PageSpeed Insights getting started page.

Once you’re on the page, scroll down a paragraph and click on the ‘Get a Key’ button.

PSI API Key

Then follow the process of creating a key – by submitting a project name, agreeing to the terms and conditions and clicking ‘next’.

PSI API Key Step 1

It will then enable the key for PSI and provide an API key which can be copied.

PSI API Key Step 2

Copy the key, and click ‘Done’.

Then simply paste this in the SEO Spider ‘Secret Key:’ field under ‘Configuration > API Access > PageSpeed Insights’ and press ‘connect’. This key is used when making calls to the API at https://www.googleapis.com/pagespeedonline/v5/runPagespeed.

PageSpeed Insights API Key Connection

That’s it, you’re now connected! The SEO Spider will remember your secret key, so you can ‘connect’ quickly upon starting the application each time.

If you find that your API key is saying it’s ‘failed to connect’, it can take a couple of minutes to activate. You can also check that the PSI API has been enabled in the API library as per our FAQ. If it isn’t enabled, enable it – and it should then allow you to connect.

Once you have connected, you can choose metrics and device to query under the ‘metrics’ tab.

PageSpeed metrics configuration

The following speed metrics, opportunities and diagnostics data can be configured to be collected via the PageSpeed Insights API integration.

Overview Metrics

  • Total Size Savings
  • Total Time Savings
  • Total Requests
  • Total Page Size
  • HTML Size
  • HTML Count
  • Image Size
  • Image Count
  • CSS Size
  • CSS Count
  • JavaScript Size
  • JavaScript Count
  • Font Size
  • Font Count
  • Media Size
  • Media Count
  • Other Size
  • Other Count
  • Third Party Size
  • Third Party Count

CrUX Metrics (‘Field Data’ in PageSpeed Insights)

  • Core Web Vitals Assessment
  • CrUX First Contentful Paint Time (sec)
  • CrUX First Contentful Paint Category
  • CrUX First Input Delay Time (sec)
  • CrUX First Input Delay Category
  • CrUX Largest Contentful Paint Time (sec)
  • CrUX Largest Contentful Paint Category
  • CrUX Cumulative Layout Shift
  • CrUX Cumulative Layout Shift Category
  • CrUX Interaction to Next Paint (ms)
  • CrUX Interaction to Next Paint Category
  • CrUX Time to First Byte (ms)
  • CrUX Time to First Byte Category
  • CrUX Origin Core Web Vitals Assessment
  • CrUX Origin First Contentful Paint Time (sec)
  • CrUX Origin First Contentful Paint Category
  • CrUX Origin First Input Delay Time (sec)
  • CrUX Origin First Input Delay Category
  • CrUX Origin Largest Contentful Paint Time (sec)
  • CrUX Origin Largest Contentful Paint Category
  • CrUX Origin Cumulative Layout Shift
  • CrUX Origin Cumulative Layout Shift Category
  • CrUX Origin Interaction to Next Paint (ms)
  • CrUX Origin Interaction to Next Paint Category
  • CrUX Origin Time to First Byte (ms)
  • CrUX Origin Time to First Byte Category

Lighthouse Metrics (‘Lab Data’ in PageSpeed Insights)

  • Performance Score
  • Time to First Byte (ms)
  • First Contentful Paint Time (sec)
  • First Contentful Paint Score
  • Speed Index Time (sec)
  • Speed Index Score
  • Largest Contentful Paint Time (sec)
  • Largest Contentful Paint Score
  • Time to Interactive (sec)
  • Time to Interactive Score
  • First Meaningful Paint Time (sec)
  • First Meaningful Paint Score
  • Max Potential First Input Delay (ms)
  • Max Potential First Input Delay Score
  • Total Blocking Time (ms)
  • Total Blocking Time Score
  • Cumulative Layout Shift
  • Cumulative Layout Shift Score

Opportunities

  • Eliminate Render-Blocking Resources Savings (ms)
  • Defer Offscreen Images Savings (ms)
  • Defer Offscreen Images Savings
  • Efficiently Encode Images Savings (ms)
  • Efficiently Encode Images Savings
  • Properly Size Images Savings (ms)
  • Properly Size Images Savings
  • Minify CSS Savings (ms)
  • Minify CSS Savings
  • Minify JavaScript Savings (ms)
  • Minify JavaScript Savings
  • Reduce Unused CSS Savings (ms)
  • Reduce Unused CSS Savings
  • Reduce Unused JavaScript Savings (ms)
  • Reduce Unused JavaScript Savings
  • Serve Images in Next-Gen Formats Savings (ms)
  • Serve Images in Next-Gen Formats Savings
  • Enable Text Compression Savings (ms)
  • Enable Text Compression Savings
  • Preconnect to Required Origin Savings
  • Server Response Times (TTFB) (ms)
  • Server Response Times (TTFB) Category (ms)
  • Multiple Redirects Savings (ms)
  • Preload Key Requests Savings (ms)
  • Use Video Format for Animated Images Savings (ms)
  • Use Video Format for Animated Images Savings
  • Total Image Optimization Savings (ms)
  • Avoid Serving Legacy JavaScript to Modern Browser Savings

Diagnostics

  • DOM Element Count
  • JavaScript Execution Time (sec)
  • JavaScript Execution Time Category
  • Efficient Cache Policy Savings
  • Minimize Main-Thread Work (sec)
  • Minimize Main-Thread Work Category
  • Text Remains Visible During Webfont Load
  • Image Elements Do Not Have Explicit Width & Height
  • Avoid Large Layout Shifts

You can read more about the definition of each metric, opportunity or diagnostic according to Lighthouse.

Filter by –

  • Eliminate Render-Blocking Resources – This highlights all pages with resources that are blocking the first paint of the page, along with the potential savings.
  • Properly Size Images – This highlights all pages with images that are not properly sized, along with the potential savings when they are resized appropriately.
  • Defer Offscreen Images – This highlights all pages with images that are hidden or offscreen, along with the potential savings if they were lazy-loaded.
  • Minify CSS – This highlights all pages with unminified CSS files, along with the potential savings when they are correctly minified.
  • Minify JavaScript – This highlights all pages with unminified JavaScript files, along with the potential savings when they are correctly minified.
  • Remove Unused CSS – This highlights all pages with unused CSS, along with the potential savings when they are removed of unnecessary bytes.
  • Remove Unused JavaScript – This highlights all pages with unused JavaScript, along with the potential savings when they are removed of unnecessary bytes.
  • Efficiently Encode Images – This highlights all pages with unoptimised images, along with the potential savings.
  • Serve Images in Next-Gen Formats – This highlights all pages with images that are in older image formats, along with the potential savings.
  • Enable Text Compression – This highlights all pages with text based resources that are not compressed, along with the potential savings.
  • Preconnect to Required Origin – This highlights all pages with key requests that aren’t yet prioritizing fetch requests with link rel=preconnect, along with the potential savings.
  • Reduce Server Response Times (TTFB) – This highlights all pages where the browser has had to wait for over 600ms for the server to respond to the main document request.
  • Avoid Multiple Redirects – This highlights all pages which have resources that redirect, and the potential saving by using the direct URL.
  • Preload Key Requests – This highlights all pages with resources that are third level of requests in your critical request chain as preload candidates.
  • Use Video Format for Animated Images – This highlights all pages with animated GIFs, along with the potential savings of converting them into videos.
  • Avoid Excessive DOM Size – This highlights all pages with a large DOM size over the recommended 1,500 total nodes.
  • Reduce JavaScript Execution Time – This highlights all pages with average or slow JavaScript execution time.
  • Serve Static Assets With An Efficient Cache Policy – This highlights all pages with resources that are not cached, along with the potential savings.
  • Minimize Main-Thread Work – This highlights all pages with average or slow execution timing on the main thread.
  • Ensure Text Remains Visible During Webfont Load – This highlights all pages with fonts that may flash or become invisible during page load.
  • Image Elements Do Not Have Explicit Width & Height – This highlights all pages that have images without dimensions (width and height size attributes) specified in the HTML. This can be a big cause of poor CLS.
  • Avoid Large Layout Shifts – This highlights all pages that have DOM elements contributing most to the CLS of the page and provides a contribution score of each to help prioritise.
  • Avoid Serving Legacy JavaScript to Modern Browsers – This highlights all pages with legacy JavaScript. Polyfills and transforms enable legacy browsers to use new JavaScript features. However, many aren’t necessary for modern browsers. For your bundled JavaScript, adopt a modern script deployment strategy using module/nomodule feature detection to reduce the amount of code shipped to modern browsers, while retaining support for legacy browsers.
  • Request Errors – This highlights any URLs which returned an error or redirect response from the PageSpeed Insights API.

Please read the Lighthouse performance audits guide for more definitions and explanations of each of the opportunities and diagnostics described above.

The speed opportunities, source pages and resource URLs that have potential savings can be exported in bulk via the ‘Reports > PageSpeed’ menu.

PageSpeed reporting

PageSpeed Insights API Limits

The API is limited to 25,000 queries a day at 60 queries per 100 seconds per user. The SEO Spider automatically controls the rate of requests to remain within these limits. With these limits in places the best case is the SEO Spider can request 36 URLs a minute. So for a crawl of 10,000 URLs this would take just over 4.5 hours.

Please consult the ‘quotas’ section of the API dashboard to view your API usage quota.

PageSpeed Insights API Errors

The PSI Status column shows whether an API request for a URL has been a success, or there has been an error. An ‘error’ usually reflects the web interface, where you would see the same error and message.

PSI Unable to process request

There two most common error messages are –

  • “500: Unable to process request. Please wait a while and try again” – This error is generally replicable in the web interface and our testing suggests that from time to time the PSI API is unable to process requests, possibly due to overall load capacity. If this occurs, we recommend pausing the crawl for 10mins until it’s available again and working in the web interface, and then right click and ‘re-spider’ URLs. This will re-request the PSI data for those URLs selected and continue crawling and requesting API data for other URLs.
  • “500: Lighthouse returned error: ERRORED_DOCUMENT_REQUEST. Lighthouse was unable to reliably load the page you requested.” – This error is again typically replicable in the web interface and is not an issue with the SEO Spider, or the API, it is directly related to the Lighthouse audit conducted by PSI. Unfortunately ‘re-spidering’ these URLs to re-request API data generally does not help. You can provide Google with feedback about any errors you experience directly on their mailing list or ask questions via Stack Overflow.

Please read our FAQ on PageSpeed Insights API Errors for more information.


Majestic

Configuration > API Access > Majestic

In order to use Majestic, you will need a subscription which allows you to pull data from their API. You then just need to navigate to ‘Configuration > API Access > Majestic’ and then click on the ‘generate an Open Apps access token’ link.

Majestic API

You will then be taken to Majestic, where you need to ‘grant’ access to the Screaming Frog SEO Spider.

Majestic API grant access

You will then be given a unique access token from Majestic.

Majestic API authorised

Copy and input this token into the API key box in the Majestic window, and click ‘connect’ –

Majestic API connected

You can then select the data source (fresh or historic) and metrics, at either URL, subdomain or domain level.

Majestic API metrics

Then simply click ‘start’ to perform your crawl, and the data will be automatically pulled via their API, and can be viewed under the ‘link metrics’ and ‘internal’ tabs.

link metrics integration

Ahrefs

Configuration > API Access > Ahrefs

In order to use Ahrefs, you will need a subscription which allows you to pull data from their API. You then just need to navigate to ‘Configuration > API Access > Ahrefs’ and then click on the ‘generate an API access token’ link.

ahrefs API integration

You will then be taken to Ahrefs, where you need to select your workspace.

Ahrefs select workspace

Then ‘allow’ access to the Screaming Frog SEO Spider.

Ahrefs allow access to Screaming Frog

You will then be given a unique access token from Ahrefs (but hosted on the Screaming Frog domain).

ahrefs API token

If a ‘We Missed Your Token’ message is displayed, then follow the instructions in our FAQ here. Then copy and input this token into the API key box in the Ahrefs window, and click ‘connect’ –

Connect to Ahrefs API

You can then select the metrics you wish to pull at either URL, subdomain or domain level.

Ahrefs API metrics

Then simply click ‘start’ to perform your crawl, and the data will be automatically pulled via their API, and can be viewed under the ‘link metrics’ and ‘internal’ tabs.


Moz

Configuration > API Access > Moz

You will require a Moz account to pull data from the Mozscape API. Moz offer a free limited API and a separate paid API, which allows users to pull more metrics, at a faster rate. Please note, this is a separate subscription to a standard Moz PRO account. You can read about free vs paid access over at Moz.

To access the API, with either a free account, or paid subscription, you just need to login to your Moz account and view your API ID and secret key.

Moz API key

Copy and input both the access ID and secret key into the respective API key boxes in the Moz window under ‘Configuration > API Access > Moz’, select your account type (‘free’ or ‘paid’), and then click ‘connect’ –

moz API integration

You can then select the metrics available to you, based upon your free or paid plan. Simply choose the metrics you wish to pull at either URL, subdomain or domain level.

moz API metrics

Then simply click ‘start’ to perform your crawl, and the data will be automatically pulled via their API, and can be viewed under the ‘link metrics’ and ‘internal’ tabs.


Authentication

Configuration > Authentication

The SEO Spider supports two forms of authentication, standards based which includes basic and digest authentication, and web forms based authentication.

Check out our video guide on how to crawl behind a login, or carry on reading below.

Basic & Digest Authentication

There is no set-up required for basic and digest authentication, it is detected automatically during a crawl of a page which requires a login. If you visit the website and your browser gives you a pop-up requesting a username and password, that will be basic or digest authentication. If the login screen is contained in the page itself, this will be a web form authentication, which is discussed in the next section.

Often sites in development will also be blocked via robots.txt as well, so make sure this is not the case or use the ‘ignore robot.txt configuration‘. Then simply insert the staging site URL, crawl and a pop-up box will appear, just like it does in a web browser, asking for a username and password.

basic authentication

Enter your credentials and the crawl will continue as normal.

Alternatively, you can pre-enter login credentials via ‘Config > Authentication’ and clicking ‘Add’ on the Standards Based tab.

Add Standards Based Authentication Details

Then input the URL, username and password.

Standards Based Authentication

When entered in the authentication config, they will be remembered until they are deleted.

This feature does not require a licence key. Try to following pages to see how authentication works in your browser, or in the SEO Spider.

Web Form Authentication

There are other web forms and areas which require you to login with cookies for authentication to be able to view or crawl it. The SEO Spider allows users to log in to these web forms within the SEO Spider’s built in Chromium browser, and then crawl it. This feature requires a licence to use it.

To log in, navigate to ‘Configuration > Authentication’ then switch to the ‘Forms Based’ tab, click the ‘Add’ button, enter the URL for the site you want to crawl, and a browser will pop up allowing you to log in.

Web Forms Authentication

Please read our guide on crawling web form password protected sites in our user guide, before using this feature. Some website’s may also require JavaScript rendering to be enabled when logged in to be able to crawl it.

Please note – This is a very powerful feature, and should therefore be used responsibly. The SEO Spider clicks every link on a page; when you’re logged in that may include links to log you out, create posts, install plugins, or even delete data.

Authentication Profiles

The authentication profiles tab allows you to export an authentication configuration to be used with scheduling, or command line.

This means it’s possible for the SEO Spider to login to standards and web forms based authentication for automated crawls.

Authentication Profiles

When you have authenticated via standards based or web forms authentication in the user interface, you can visit the ‘Profiles’ tab, and export an .seospiderauthconfig file.

This can be supplied in scheduling via the ‘start options’ tab, or using the ‘auth-config’ argument for the command line as outlined in the CLI options.

Authentication Profiles In Scheduling

Please note – We can’t guarantee that automated web forms authentication will always work, as some websites will expire login tokens or have 2FA etc. Exporting or saving a default authentication profile will store an encrypted version of your authentication credentials on disk using AES-256 Galois/Counter Mode.

Troubleshooting

  • Forms based authentication uses the configured User Agent. If you are unable to login, perhaps try this as Chrome or another browser.

Segments

Configuration > Segments

You can segment a crawl to better identify and monitor issues and opportunities from different templates, page types, or areas of priority.

Watch our video, or read our guide below on how to set up segments.

The segments right-hand tab and configuration is only available if you’re using database storage mode.

If you’re not already using database storage mode, we highly recommend it. This can be adjusted via ‘File > Settings > Storage Mode’ and has a number of benefits.

The segmentation config can be accessed via the config menu or right-hand ‘Segments’ tab, and it allows you to segment based upon any data found in the crawl, including data from APIs such as GA or GSC, or post-crawl analysis.

You can set up a segment at the start, during, or at the end of a crawl. Once set-up, segments can be saved with the configuration.

Set Up Segments

A segments column will appear in each tab with coloured labels against each URL with their segment.

Segment columns and labels

When segments are set up, the right hand ‘Issues’ tab includes a segments bar, so you can quickly see where on the site the issues are at a glance.

Issues tab with segments

You can then use the right-hand segments filter, to drill down to individual segments.

Segments global filter

The right-hand ‘Segments’ tab is an aggregated view, to quickly see where issues are by segment.

right-hand Segments tab

You can use the Segments tab ‘view’ filter to better analyse items like crawl depth by segment, or which segments have different types of issues.

Segments are fully integrated into various other features in the SEO Spider as well.

  • You can select to colour crawl visualisations by segments.
  • You can choose to create XML Sitemaps by segment, and the SEO Spider will automatically create a Sitemap Index file referencing each segmented sitemap.
  • Within the Export for Looker Studio for automated crawl reports, a separate sheet will also be automatically created for each segment when a saved configuration is supplied with segments set-up. This means you can monitor issues by segment in a Looker Studio Crawl Report as well.

Crawl analysis

Configuration > Crawl Analysis

The SEO Spider usually analyses and reports data at run-time, where metrics, tabs and filters are populated during a crawl. However, ‘Link Score’ and a relatively small number of filters require calculation at the end of a crawl (or when a crawl has been stopped).

The full list of items that require ‘crawl analysis’ can be viewed below, and seen under ‘Config > Crawl Analysis’.

Crawl Analysis

All of the above are filters under their respective tabs, apart from ‘Link Score’, which is a metric and shown as a column in the ‘Internal’ tab.

In the right hand ‘overview’ window pane, filters which require post ‘crawl analysis’ are marked with ‘Crawl Analysis Required’ for further clarity. The ‘Sitemaps’ filters in particular, mostly require post-crawl analysis.

Right hand overview crawl analysis required

They are also marked as ‘You need to perform crawl analysis for this tab to populate this filter’ within the main window pane.

Crawl Analysis tabs message

This analysis can be automatically performed at the end of a crawl by ticking the respective ‘Auto Analyse At End of Crawl’ tickbox under ‘Configure’, or it can be run manually by the user.

To run the crawl analysis, simply click ‘Crawl Analysis > Start’ in the top level menu.

Start Crawl Analysis

When the crawl analysis is running you’ll see the ‘analysis’ progress bar with a percentage complete. The SEO Spider can continue to be used as normal during this period.

Crawl Analysis Running

When the crawl analysis has finished, the empty filters which are marked with ‘Crawl Analysis Required’, will be populated with lots of lovely insightful data.

Filter populated after crawl analysis

Please note – The Analytics and Search Console orphan URLs filters will only be populated if you have connected to their respective APIs and chosen to ‘Crawl New URLs Discovered in Google Analytics/Google Search Console’ under their ‘general’ tabs. Otherwise, orphan URLs will only be viewable under ‘Reports > Orphan Pages’.

For more information, watch our video guide on crawl analysis.


User Interface

File > Settings > User Interface (Windows, Linux)
Screaming Frog SEO Spider > Settings > User Interface (macOS)

There are a few configuration options under the user interface menu. These are as follows –

  • Theme > Light / Dark – By default the SEO Spider uses a light grey theme. However, you can switch to a dark theme (aka, ‘Dark Mode’, ‘Batman Mode’ etc). This theme can help reduce eye strain, particularly for those that work in low light.
  • Accent Colour – The SEO Spider uses green as it’s default colour for highlighting rows, cells and other UI options. However, you can adjust this to your own preference.
Dark Mode

Language

File > Settings > Language (Windows, Linux)
Screaming Frog SEO Spider > Settings > Language (macOS)

The GUI is available in English, Spanish, German, French and Italian. It will detect the language used on your machine on startup, and default to using it.

Language can also be set within the tool via ‘File > Settings > Language’.

Language Config

We may support more languages in the future, and if there’s a language you’d like us to support, please let us know via support.


Proxy

File > Settings > Proxy (Windows, Linux)
Screaming Frog SEO Spider > Settings > Language (macOS)

The proxy feature allows you the option to configure the SEO Spider to use a proxy server.

You will need to configure the address and port of the proxy in the configuration window. To disable the proxy server untick the ‘Use Proxy Server’ option.

Please note:

  • Only 1 proxy server can be configured.
  • You must restart for your changes to take effect.
  • No exceptions can be added – either all HTTP/HTTPS traffic goes via the proxy, or none of it does.
  • Some proxies may require you to input login details before the crawl using forms based authentication.

Storage mode

File > Settings > Storage Mode (Windows, Linux)
Screaming Frog SEO Spider > Settings > Storage Mode (macOS)

The Screaming Frog SEO Spider uses a configurable hybrid engine, allowing users to choose to store crawl data in RAM, or in a database.

Database Storage Mode

By default the SEO Spider uses RAM, rather than your hard disk to store and process data. This provides amazing benefits such as speed and flexibility, but it does also have disadvantages, most notably, crawling at scale.

However, if you have an SSD the SEO Spider can also be configured to save crawl data to disk, by selecting ‘Database Storage’ mode (under ‘File > Settings> Storage Mode’), which enables it to crawl at truly unprecedented scale, while retaining the same, familiar real-time reporting and usability.

Fundamentally both storage modes can still provide virtually the same crawling experience, allowing for real-time reporting, filtering and adjusting of the crawl. However, there are some key differences, and the ideal storage, will depend on the crawl scenario, and machine specifications.

Memory Storage

Memory storage mode allows for super fast and flexible crawling for virtually all set-ups. However, as machines have less RAM than hard disk space, it means the SEO Spider is generally better suited for crawling websites under 500k URLs in memory storage mode.

Users are able to crawl more than this with the right set-up, and depending on how memory intensive the website is that’s being crawled. As a very rough guide, a 64-bit machine with 8gb of RAM will generally allow you to crawl a couple of hundred thousand URLs.

As well as being a better option for smaller websites, memory storage mode is also recommended for machines without an SSD, or where there isn’t much disk space.

Database Storage

We recommend this as the default storage for users with an SSD, and for crawling at scale.

Database storage mode allows for more URLs to be crawled for a given memory setting, with close to RAM storage crawling speed for set-ups with a solid state drive (SSD).

The full benefits of database storage mode include:

  • Crawling at larger scale.
  • Opening large crawls is quicker.
  • If you lose power, accidentally clear, or close a crawl, it won’t be lost. Crawls are auto saved, and can be opened again via ‘File > Crawls’.
  • Crawl comparison and change detection features are only available in this mode.

The default crawl limit is 5 million URLs, but it isn’t a hard limit – the SEO Spider is capable of crawling significantly more (with the right set-up). As an example, a machine with a 500gb SSD and 16gb of RAM, should allow you to crawl up to 10 million URLs approximately.

While not recommended, if you have a fast hard disk drive (HDD), rather than a solid state disk (SSD), then this mode can still allow you to crawl more URLs. However, writing and reading speed of a hard drive does become the bottleneck in crawling – so both crawl speed, and the interface itself will be significantly slower.

Using a network drive is not supported – this will be much too slow and the connection unreliable. Using a local folder that syncs remotely, such as Dropbox or OneDrive is not supported due to these processes locking files. Vault drives are also not supported.

If you’re working on the machine while crawling, it can also impact machine performance, so the crawl speed might require to be reduced to cope with the load. SSDs are so fast, they generally don’t have this problem and this is why ‘database storage’ can be used as the default for both small and large crawls.

Check out our video guide on storage modes.

Troubleshooting

  • If you get a red X rather than a green tick next to Database Directory, hover over it to see the error message.
  • If the error message includes “OverlappingFileLockException” this means you are using an ExFAT/MS-DOS (FAT) file systems which is not supported on macOS due to JDK-8205404. You’ll need to choose a drive with a different format or reformat your drive to a different format to resolve this. You can use the Disk Utility application to view the current format and reformat the drive.

Memory allocation

File > Settings > Memory Allocation (Windows, Linux)
Screaming Frog SEO Spider > Settings > Memory Allocation (macOS)

The SEO Spider uses Java which requires memory to be allocated at start-up. By default the SEO Spider will allow 1gb for 32-bit, and 2gb for 64-bit machines.

Increasing memory allocation will enable the SEO Spider to crawl more URLs, particularly when in RAM storage mode, but also when storing to database.

We recommend setting the memory allocation to at least 2gb below your total physical machine memory so the OS and other applications can operate.

Memory Allocation

If you’d like to find out more about crawling large websites, memory allocation and the storage options available, please see our guide on crawling large websites.


Trusted Certificates

File > Settings > Trusted Certificates (Windows, Linux)
Screaming Frog SEO Spider > Settings > Trusted Certificates (macOS)

A Man In The Middle (MITM) proxy will resign TLS certificates. If a resigned certificate is not from a trusted Certificate Authority (CA), the TLS connection will be rejected.

Trusted Certificates

Companies employing this style of proxy will usually distribute an X.509 certificate to employees. This X.509 certificate can be used by the SEO Spider by adding it to a ‘Trusted Certificates Folder’.

The SEO Spider will only accept X.509 certificates with the following extensions: .crt, .pem, .cer and .der.

How To Add A Trusted Certificate

When a proxy is changing the issuer of a certificate, it can be quickly seen in a browser such as Chrome. Simply load the Screaming Frog website in Chrome and click the ‘view site information’ icon to the left of the address.

Trusted Certificates more info

Next, click on ‘Connection is secure’ (or ‘Connection is insecure’) and ‘Certificate is valid’ –

Trusted Certificates certificate valid

Then click on ‘Details’ and ‘Issuer’ in the ‘Certificate Fields’ section –

Trusted Certificates certificate viewer

The issuer for the Screaming Frog website shows as “GTS CA 1P5”. But you may see this as something different, such as your proxy, for example ZScaler etc. This shows the issuer of the certificate is being changed in your local environment.

If that is the case, click on the “GTS CA 1P5” issuer certificate and then ‘Export’ at the bottom.

Trusted Certificates export

Then ‘Add’ this certificate file to the SEO Spider Trusted Certificates trust store.

Trusted certificate in the trust store

Mode

Mode > Spider / List / SERP

Spider Mode

This is the default mode of the SEO Spider. In this mode the SEO Spider will crawl a web site, gathering links and classifying URLs into the various tabs and filters. Simply enter the URL of your choice and click ‘start’.

List Mode

In this mode you can check a predefined list of URLs. This list can come from a variety of sources – a simple copy and paste, or a .txt, .xls, .xlsx, .csv or .xml file. The files will be scanned for http:// or https:// prefixed URLs, all other text will be ignored. For example, you can directly upload an Adwords download and all URLs will be found automatically.

List Mode - Crawling a List of URLs

If you’re performing a site migration and wish to test URLs, we highly recommend using the ‘always follow redirects‘ configuration so the SEO Spider finds the final destination URL. The best way to view these is via the ‘redirect chains’ report, and we go into more detail within our ‘How To Audit Redirects‘ guide.

List mode changes the crawl depth setting to zero, which means only the uploaded URLs will be checked. If you want to check links from these URLs, adjust the crawl depth to 1 or more in the ‘Limits’ tab in ‘Configuration > Spider’. List mode also sets the spider to ignore robots.txt by default, we assume if a list is being uploaded the intention is to crawl all the URLs in the list.

If you wish to export data in list mode in the same order it was uploaded, then use the ‘Export’ button which appears next to the ‘upload’ and ‘start’ buttons at the top of the user interface.

Export in same order as uploaded

The data in the export will be in the same order and include all of the exact URLs in the original upload, including duplicates or any fix-ups performed.

If you’d like to learn how to perform more advancing crawling in list mode, then read our how to use list mode guide.

SERP Mode

In this mode you can upload page titles and meta descriptions directly into the SEO Spider to calculate pixel widths (and character lengths!). There is no crawling involved in this mode, so they do not need to be live on a website.

This means you can export page titles and descriptions from the SEO Spider, make bulk edits in Excel (if that’s your preference, rather than in the tool itself) and then upload them back into the tool to understand how they may appear in Google’s SERPs.

Under ‘reports’, we have a new ‘SERP Summary’ report which is in the format required to re-upload page titles and descriptions. We simply require three headers for ‘URL’, ‘Title’ and ‘Description’.

For example –

serp-snippet-upload-format

You can upload in a .txt, .csv or Excel file.

Compare

This mode allows you to compare two crawls and see how data has changed in tabs and filters over time. Please see how tutorial on ‘How To Compare Crawls’ for a walk-through guide.

The compare feature is only available in database storage mode with a licence. If you haven’t already moved, it’s as simple as ‘File > Settings > Storage Mode’ and choosing ‘Database Storage’.

There are two options to compare crawls –

1) Switch to ‘compare’ mode via ‘Mode > Compare’ and click ‘Select Crawl’ via the top menu to pick two crawls you wish to compare.

Mode Compare

2) When in ‘Spider’ or ‘List’ modes go to ‘File > Crawls’, highlight two crawls, and ‘Select To Compare’, which will switch you to ‘compare’ mode.

Select to compare crawls

You can then adjust the compare configuration via the ‘cog’ icon, or clicking ‘Config > Compare’. This allows you to select additional elements to analyse for change detection.

Then click ‘Compare’ for the crawl comparison analysis to run and the right hand overview tab to populate and show current and previous crawl data with changes.

Crawl Comparison Overview tab

You’re able to click on the numbers in the columns to view which URLs have changed, and use the filter on the master window view to toggle between current and previous crawls, or added, new, removed or missing URLs.

There are four columns and filters that help segment URLs that move into tabs and filters.

Added – URLs in previous crawl that moved to filter of current crawl.

New – New URLs not in the previous crawl, that are in current crawl and fiter.

Removed – URLs in filter for previous crawl, but not in filter for current crawl.

Missing – URLs not found in the current crawl, that previous were in filter.

Essentially ‘added’ and ‘removed’ are URLs that exist in both current and previous crawls, whereas ‘new’ and ‘missing’ are URLs that only exist in one of the crawls.

When you have completed a crawl comparison, a small comparison file is automatically stored in ‘File > Crawls’, which allows you to open and view it without running the analysis again.

This file utilises the two crawls compared. Therefore they are both required to be stored to view the comparison. Deleting one or both of the crawls in the comparison will mean the comparison will not be accessible anymore.

Please refer to our tutorial on ‘How To Compare Crawls‘ for more.

Purchase a licence

SEO Spider Log File Analyser

Join the mailing list for updates, tips & giveaways

Back to top