These settings can be accessed from the menu under Configuration->Spider.
Check Images – Check Images – Untick this box if you do not want to crawl images. (Please note, we check the link but don’t crawl the content). This prevents the spider checking images linked to using the image tag (img src=”image.jpg”). Images linked to via any other means will still be checked, for example, using an anchor tag (a href=”image.jpg”).
Check CSS – Untick this box if you do not want to crawl CSS.
Check Links Outside Of Start Folder – Untick this box if you do not want to crawl links outside of a sub folder you start from. This option provides you the ability to crawl within a start sub folder, but still crawl links that those URLs link to which are outside of the start folder.
Follow Internal or External ‘nofollow’ – By default the spider will not crawl internal or external links with the ‘nofollow’ attribute or external links from pages with the meta nofollow tag. If you would like the spider to crawl these, simply tick the relevant option.
Crawl All Subdomains – By default the SEO spider will only crawl the subdomain you crawl from and treat all other subdomains encountered as external sites. To crawl all subdomains of a root domain, use this option.
Crawl Outside Of Start Folder – By default the SEO spider will only crawl the subfolder (or sub directory) you crawl from forwards. However, if you wish to start a crawl from a specific sub folder, but crawl the entire website, use this option.
Crawl Canonicals – By default the SEO spider will crawl canonicals (canonical link elements or http header) and use the links contained within for discovery. If you do not wish to crawl canonicals, then please untick this box. Please note, that canonicals will still be reported and referenced in the SEO Spider, but they will not be crawled for discovery.
Ignore robots.txt – By default the spider will obey robots.txt protocol. The spider will not be able to crawl a site if its disallowed via robots.txt. However, this option allows you to ignore this protocol which is down to the responsiblity of the user. This option actually means the SEO spider will not even download the robots.txt file. So it also means ALL robots directives will be ignored.
Limit Search Total – The free version of the software will crawl 500 URI. If you have a licensed version of the tool this will be removed, but you can include any number here for greater control over the number of pages you wish to crawl.
Pause On High Memory Usage – The SEO spider will automatically pause when a crawl has reached the memory allocation and display a ‘high memory usage’ message. However, you can choose to turn this safeguard off completely.
Always Follow Redirects – This feature allows the SEO Spider to follow redirects until the final redirect target URL in list mode, ignoring crawl depth. This is particularly useful for site migrations, where URLs may perform a number of 3XX redirects, before they reach their final destination. To view the chain of redirects, we recommend using the ‘redirect chains‘ report.
Page Title & Meta Description Width – This option provides the ability to control the character and pixel width limits in the SEO Spider filters in the page title and meta description tabs. For example, changing the minimum pixel width default number of ‘200’, would change the ‘Below 200 Pixels’ filter in the ‘Page Titles’ tab. This allows you to set your own character and pixel width based upon your own preferences.
The URL rewriting feature allows you to rewrite URLs on the fly. For the majority of cases, the ‘remove parameters’ and common options (under ‘options’) will suffice. However, we do also offer an advanced regex replace feature which provides further control.
This feature allows you to automatically remove parameters in URLs. This is extremely useful for websites with session IDs or lots of parameters which you wish to remove. For example –
If the website has session IDs which make the URLs appear something like this ‘example.com/?sid=random-string-of-characters’. To remove the session ID, you just need to add ‘sid’ (without the apostrophes) within the ‘parameters’ field in the ‘remove paramaters’ tab.
The SEO spider will then automatically strip the session ID from the URL. You can test to see how a URL will be rewritten by our SEO spider under the ‘test’ tab.
We will include common options under this section. The ‘lowercase discovered URLs’ option does exactly that, it converts all URLs crawled into lowercase which can be useful for websites with case sensitivity issues in URLs.
This advanced feature runs against each URL found during a crawl. It replaces each substring of a URL that matches the regex with the given replace string. The “Regex Replace” feature can be tested in the “Test” tab of the “URL Rewriting” configuration window.
1) Changing all links to example.com to be example.co.uk
2) Making all links containing page=number to a fixed number, eg
To make all these go to www.example.com/page.php?page=1
3) Removing the www. domain from any url by using an empty Replace. If you want to remove a query string parameter, please use the “Remove Parameters” feature – Regex is not the correct tool for this job!
This feature allows you to control which URL path the SEO spider will crawl via regex. It narrows the default search by only crawling the URLs that match the regex which is particularly useful for larger sites, or sites with less intuitive URL structures. Matching is performed on the url encoded version of the URL.
The page that you start the crawl from must have an outbound link which matches the regex for this feature to work. (Obviously if there is not a URL which matches the regex from the start page, the SEO spider will not crawl anything!).
This allows you to exclude URLs from a crawl by supplying a list of a list regular expressions (regex). The exclude list is applied to new URLs that are discovered during the crawl. This exclude list does not get applied to the initial URL(s) supplied in crawl or list mode. Changing the exclude list during a crawl will only affect newly discovered URLs from then on. It will not be applied retrospectively to the list of pending URLs. Matching is performed on the url encoded version of the URL.
Here are some common examples –
You can also view our video guide about the exclude feature in the SEO Spider –
This feature allows you to control the speed of the spider, either by number of concurrent threads or by URLs requested per second.
When reducing speed, it’s always easier to control by the ‘Max URI/s’ option, which is the maximum number of URL requests per second. For example, the screenshot below would mean crawling at 1 URL per second –
The ‘Max Threads’ option can simply be left alone.
Increasing the number of threads allows you to significantly increase the speed of the SEO spider.
However, please use responsibly as setting the number of threads high to increase the speed of the crawl will increase the number of http requests made to the server and can impact a site’s response times. We recommend approving a crawl rate with the webmaster first, monitoring response times and adjusting speed if there are any issues.
The user-agent switcher has inbuilt preset user agents for Googlebot, Bingbot, Yahoo! Slurp, various browsers and more. This feature also has a custom user-agent setting which allows you to specify your own user agent.
Details on how the SEO Spider handles robots.txt can be found here.
You can find the ‘Accept-Language’ configuration under ‘Configuration > HTTP Header > Accept-Language’.
This configuration allows you to include an Accept-Language HTTP header in the SEO Spider’s request. There are scenarios where you may wish to supply this header to crawl locale-adaptive content.
The SEO Spider allows you to find anything you want in the source code of a website. The custom regex search feature will check the source code of every page you decide to crawl for what it is you wish to find. There are ten filters in total under the ‘custom’ configuration menu which allow you to input your regex and find pages that either ‘contain’ or ‘does not contain’ your chosen input. You cannot ‘scrape’ or extract data from html elements using this feature at the moment.
The pages that either do or do not contain these can be found in the ‘custom’ heading tab and using the relevant filter number which match those in your configuration. For example, you may wish to choose ‘contains’ for pages like ‘Out of Stock’ as you wish to find any pages which have this on. When searching for something like Google Analytics code, it would make more sense to choose the ‘does not contain’ filter to find pages that do not include the code (rather than just list all those that do!). For example –
In this example above, any pages with ‘out of stock’ on them would appear in the custom tab under filter 1. Any pages which the spider could not find the Analytics UA number on would be listed under filter 2.
Please remember – the custom search checks the html source code of a website which might not be the text that is rendered in your browser. Hence, please ensure you are searching for the correct query from the source code.
The custom extraction feature allows you to collect any data from the HTML of a URL. Extraction is performed on the static html returned by internal HTML pages with a 2xx response code. The SEO Spider supports the following modes to perform data extraction:
When using XPath or CSS Path to collect HTML, you can choose what to extract:
You’ll receive a tick next to your regex, Xpath or CSS Path if the syntax is valid. If you’ve made a mistake, a red cross will remain in place!
The results of the data extraction appear under the ‘custom’ tab and ‘extraction’ filter. They are also included as columns within the ‘Internal’ tab as well.
Some extraction examples include the following –
The data extracted is –
As default, the SEO Spider only collects h1s and h2s. However, perhaps you would like to collect h3s, the Xpath to collect the first couple of h3s in the code would be –
And the first couple of h3s on our site are as follows –
If you wanted to pull mobile annotations from a website, you might use an Xpath such as –
//link[contains(@media, '640') and @href]/@href
Which for the Huffington Post would return –
You will need to to count how many hreflang there are on a page first, before compiling this Xpath. However, to collect the first couple the Xpath would be –
The above will collect the entire HTML element, with the link and hreflang value.
So, perhaps you wanted just the hreflang values, you could specify the attribute using @hreflang.
Which would simply return the language value, like ‘en-GB’ for example.
You may wish to extract social meta tags, such as Facebook Open Graph tags or Twitter Cards. The Xpath is for example –
You may wish to collect the types of various Schema on a page, so the set-up might be –
Perhaps you wanted to collect email addresses from your website or websites, the Xpath might be something like –
From our website, this would return the two email addresses we have in the footer on every page –
This feature (Configuration->Proxy) allows you the option to configure the SEO spider to use a proxy server. You will need to configure the address and port of the proxy in the configuration window. To disable the proxy server untick the “Use Proxy Server” option.
You can connect to the Google Analytics API and pull in data directly during a crawl. The SEO Spider can fetch user and session metrics, as well as goal conversions and ecommerce (transactions and revenue) data for landing pages, so you can view your top performing pages when performing a technical or content audit.
If you’re running an Adwords campaign, you can also pull in impressions, clicks, cost and conversion data and the SEO Spider will match your destination URLs against the site crawl, too. You can also collect other metrics of interest, such as Adsense data (Ad impressions, clicks revenue etc), site speed or social activity and interactions.
To set this up, start the SEO Spider and go to ‘Configuration > API Access > Google Analytics’.
Then you just need to connect to a Google account (which has access to the Analytics account you wish to query) by granting the ‘Screaming Frog SEO Spider’ app permission to access your account to retreive the data. Google APIs use the OAuth 2.0 protocol for authentication and authorisation. The SEO Spider will remember any Google accounts you authorise within the list, so you can ‘connect’ quickly upon starting the application each time.
Once you have connected, you can choose the relevant Google Analytics account, property, view, segment and date range!
Then simply select the metrics that you wish to fetch! The SEO Spider currently allow you to select up to 20, which we might extend further. If you keep the number of metrics to 10 or below with a single dimension (as a rough guide), then it will generally be a single API query per 10k URLs, which makes it super quick –
As default the SEO Spider collects the following 10 metrics –
You can read more about the definition of each metric from Google.
There are scenarios where URLs in Google Analytics might not match URLs in a crawl, so we cover these by matching trailing and non-trailing slash URLs and case sensitivity (upper and lowercase characters in URLs). Google doesn’t pass the protocol (HTTP or HTTPS) via their API, so we also match this data automatically.
If you have hundreds of thousands of URLs in GA, you can choose to limit the number of URLs to query, which is by default ordered by sessions to return the top performing page data.
When you hit ‘start’ to crawl, the Google Analytics data will then be fetched and display in respective columns within the ‘Internal’ and ‘Analytics’ tabs. There’s a separate ‘Analytics’ progress bar in the top right and when this has reached 100%, crawl data will start appearing against URLs. The more URLs you query, the longer this process can take, but generally it’s extremely quick.
There are 3 filters currently under the ‘Analytics’ tab, which allow you to filter the Google Analytics data –
As an example for our own website, we can see there is ‘no GA data’ for blog category pages and a few old blog posts, as you might expect (the query was landing page, rather than page). Remember, you may see pages appear here which are ‘noindex’ or ‘canonicalised’, unless you have ‘respect noindex‘ and ‘respect canonicals‘ ticked in the advanced configuration tab.
If GA data does not get pulled into the SEO Spider as you expected, then analyse the URLs in GA under ‘Behaviour > Site Content > All Pages’ and ‘Behaviour > Site Content > Landing Pages’ depending on which dimension you choose in your query. The URLs here need to match those in the crawl, for the data to be matched accurately. If they don’t match, then the SEO Spider won’t be able to match up the data accurately.
We recommend checking your default Google Analytics view settings (such as ‘default page’) and filters which all impact how URLs are displayed and hence matched against a crawl. If you want URLs to match up, you can often make the required amends within Google Analytics.
This is the default mode of the SEO Spider. In this mode the SEO Spider will crawl a web site, gathering links and classifying URLs into the various tabs and filters. Simply enter the URL of your choice and click ‘start’.
In this mode you can check a predefined list of URLs. This list can come from a variety of sources – .txt, .xls, .xlsx, .csv or .xml files. The files will be scanned for http:// or https:// prefixed urls, all other text will be ignored. For example, you can directly upload an Adwords download and all URLs will be found automatically.
If you’re performing a site migration and wish to test URLs, we highly recommend using the ‘always follow redirects‘ configuration so the SEO Spider finds the final destination URL. The best way to view these is via the ‘redirect chains’ report.
List mode changes the crawl depth setting to zero, which means only the uploaded URLs will be checked. If you want to check links from these URLs, adjust the crawl depth to 1 or more in the “Limits” tab in Configuration->Spider.
In this mode you can upload page titles and meta descriptions directly into the SEO Spider to calculate pixel widths (and character lengths!). There is no crawling involved in this mode, so they do not need to be live on a website.
This means you can export page titles and descriptions from the SEO Spider, make bulk edits in Excel (if that’s your preference, rather than in the tool itself) and then upload them back into the tool to understand how they may appear in Google’s SERPs.
Under ‘reports’, we have a new ‘SERP Summary’ report which is in the format required to re-upload page titles and descriptions. We simply require three headers for ‘URL’, ‘Title’ and ‘Description’.
For example –
You can upload in a .txt, .csv or Excel file.