SEO Spider Configuration

Download | User Guide | FAQ | Support | Terms | Purchase
General | Configuration | Tabs

 

Spider Basic Tab

Check Images – Check Images – Untick this box if you do not want to crawl images. (Please note, we check the link but don’t crawl the content). This prevents the spider checking images linked to using the image tag (img src=”image.jpg”). Images linked to via any other means will still be checked, for example, using an anchor tag (a href=”image.jpg”).
 
Check CSS - Untick this box if you do not want to crawl CSS. 

Check JavaScript – Untick this box if you do not want to crawl JavaScript. 

Check SWF – Untick this box if you do not want to crawl flash files. 

Check External Links – Untick this box if you do not want to crawl any external links. 

Check Links Outside Of Start Folder – Untick this box if you do not want to crawl links outside of a sub folder you start from. This option provides you the ability to crawl within a start sub folder, but still crawl links that those URLs link to which are outside of the start folder. 

Follow Internal or External ‘nofollow’ - By default the spider will not crawl internal or external links with the ‘nofollow’ attribute or external links from pages with the meta nofollow tag. If you would like the spider to crawl these, simply tick the relevant option. 

Crawl All Subdomains – By default the SEO spider will only crawl the subdomain you crawl from and treat all other subdomains encountered as external sites. To crawl all subdomains of a root domain, use this option.  

Crawl Outside Of Start Folder – By default the SEO spider will only crawl the subfolder (or sub directory) you crawl from forwards. However, if you wish to start a crawl from a specific sub folder, but crawl the entire website, use this option. 

Crawl Canonicals – By default the SEO spider will crawl canonicals (canonical link elements or http header) and use the links contained within for discovery. If you do not wish to crawl canonicals, then please untick this box. Please note, that canonicals will still be reported and referenced in the SEO Spider, but they will not be crawled for discovery. 

Ignore robots.txt – By default the spider will obey robots.txt protocol. The spider will not be able to crawl a site if its disallowed via robots.txt. However, this option allows you to ignore this protocol which is down to the responsiblity of the user. This option actually means the SEO spider will not even download the robots.txt file. So it also means ALL robots directives will be ignored. 

Spider Limits Tab

Limit Search Total – The free version of the software will crawl 500 URI. If you have a licensed version of the tool this will be removed, but you can include any number here for greater control over the number of pages you wish to crawl. 

Limit Search Depth – You can choose how deep the spider crawls a site (in terms of links away from your chosen start point). 

Limit Max URL Length To Crawl – Control the length of URLs that the SEO Spider will crawl.  

Limit Max Folder Depth – Control the number of folders (or sub directories) the SEO Spider will crawl.  

Limit Number of Query Strings – Control the number of query string parameters (?x=) the SEO Spider will crawl.  

Spider Advanced Tab

Allow Cookies – As default the SEO spider does not accept cookies, like a search bot. However, you can choose to accept cookies by ticking this box.  

Request Authentication – As default the SEO spider will show a login box when a URL that’s been requested requires authentication. This option can be switched off.  

Pause On High Memory Usage – The SEO spider will automatically pause when a crawl has reached the memory allocation and display a ‘high memory usage’ message. However, you can choose to turn this safeguard off completely.  

Always Follow Redirects – This feature allows the SEO Spider to follow redirects until the final redirect target URL in list mode, ignoring crawl depth. This is particularly useful for site migrations, where URLs may perform a number of 3XX redirects, before they reach their final destination. To view the chain of redirects, we recommend using the ‘redirect chains‘ report.  

Respect noindex – This option means URLs with ‘noindex’ will not be reported in the SEO Spider.  

Respect Canonical – This option means URLs which have been canonicalised to another URL, will not be reported in the SEO Spider.  

Response Timeout – As default the SEO spider will wait 10 seconds to get anykind of http response from a URL. You can increase the length of waiting time, which is useful for very slow websites.  

5XX Response Retries – This option provides the ability to automatically re-try 5XX responses. Often these responses can be temporary, so re-trying a URL may provide a 2XX response.  

Max Redirects To Follow – This option provides the ability to control the number of redirects the SEO Spider will follow.  

Spider Preferences Tab

Page Title & Meta Description Width – This option provides the ability to control the character and pixel width limits in the SEO Spider filters in the page title and meta description tabs. For example, changing the minimum pixel width default number of ‘200’, would change the ‘Below 200 Pixels’ filter in the ‘Page Titles’ tab. This allows you to set your own character and pixel width based upon your own preferences.  

Other – These options provide the ability to control the character length of URLs, h1, h2 and image alt text filters in their respective tabs. You can also control the max image size.  

URL Rewriting

The URL rewriting feature allows you to rewrite URLs on the fly. For the majority of cases, the ‘remove parameters’ and common options (under ‘options’) will suffice. However, we do also offer an advanced regex replace feature which provides further control.

Remove Parameters

This feature allows you to automatically remove parameters in URLs. This is extremely useful for websites with session IDs or lots of parameters which you wish to remove. For example –

If the website has session IDs which make the URLs appear something like this ‘example.com/?sid=random-string-of-characters’. To remove the session ID, you just need to add ‘sid’ (without the apostrophes) within the ‘parameters’ field in the ‘remove paramaters’ tab.

The SEO spider will then automatically strip the session ID from the URL. You can test to see how a URL will be rewritten by our SEO spider under the ‘test’ tab.

Options

We will include common options under this section. The ‘lowercase discovered URLs’ option does exactly that, it converts all URLs crawled into lowercase which can be useful for websites with case sensitivity issues in URLs.

Regex Replace

This advanced feature runs against each URL found during a crawl. It replaces each substring of a URL that matches the regex with the given replace string. The “Regex Replace” feature can be tested in the “Test” tab of the “URL Rewriting” configuration window.

Examples are:

1) Changing all links to example.com to be example.co.uk

Regex: .com
Replace: .co.uk

2) Making all links containing page=number to a fixed number, eg

www.example.com/page.php?page=1
www.example.com/page.php?page=2
www.example.com/page.php?page=3
www.example.com/page.php?page=4

To make all these go to www.example.com/page.php?page=1

Regex: page=\d+
Replace: page=1

3) Removing the www. domain from any url by using an empty Replace. If you want to remove a query string parameter, please use the “Remove Parameters” feature – Regex is not the correct tool for this job!

Regex: www.
Replace:
 

Include

This feature allows you to control which URL path the SEO spider will crawl via regex. It narrows the default search by only crawling the URLs that match the regex which is particularly useful for larger sites, or sites with less intuitive URL structures.

The page that you start the crawl from must have an outbound link which matches the regex for this feature to work. (Obviously if there is not a URL which matches the regex from the start page, the SEO spider will not crawl anything!).


Exclude

This allows you to list files and paths to exclude from crawling. This feature used to be robots.txt syntax but has now switched to regex for greater control and more balance from version 1.80. For example –

http://www.example.com/do-not-crawl-this-page.html

http://www.example.com/do-not-crawl-this-folder/.*

You can also view our video guide about the exclude feature in the SEO Spider –

Speed

This feature allows you to control the speed of the spider, either by number of concurrent threads or by URLs requested per second. Increasing the number of threads allows you to significantly increase the speed of the SEO spider, so please use responsibly.

When reducing speed, it’s always easier to control by the ‘Max URI/s’ option, which is the maximum number of URL requests per second. For example, the screenshot below would mean crawling at 1 URL per second –

seo-spider-speed

The ‘Max Threads’ option can simply be left alone.
 

User Agent

The user-agent switcher has inbuilt preset user agents for Googlebot, Bingbot, Yahoo! Slurp, various browsers and more. This feature also has a custom user-agent setting which allows you to specify your own user agent.  

Custom

The spider allows you to find anything you want in the source code of a website. The custom regex search feature will check the source code of every page you decide to crawl for what it is you wish to find. There are ten filters in total under the ‘custom’ configuration menu which allow you to input your regex and find pages that either ‘contain’ or ‘does not contain’ your chosen input. You cannot ‘scrape’ or extract data from html elements using this feature at the moment.

The pages that either do or do not contain these can be found in the ‘custom’ heading tab and using the relevant filter number which match those in your configuration. For example, you may wish to choose ‘contains’ for pages like ‘Out of Stock’ as you wish to find any pages which have this on. When searching for something like Google Analytics code, it would make more sense to choose the ‘does not contain’ filter to find pages that do not include the code (rather than just list all those that do!). For example –

In this example above, any pages with ‘out of stock’ on them would appear in the custom tab under filter 1. Any pages which the spider could not find the Analytics UA number on would be listed under filter 2.

Please remember – the custom search checks the html source code of a website which might not be the text that is rendered in your browser. Hence, please ensure you are searching for the correct query from the source code.
 

Proxy

This feature allows you to use a proxy with the SEO spider by specifying the address and port.
 

Mode

List Mode

The default ‘mode’ is spider. Simply enter the URL of your choice and click ‘start’ to crawl the website. Alternatively switch to ‘list’ mode to upload a list of URLs to the spider or crawl a .xml file. Simply click ‘select file’ and browse to your file which contains the list of URLs or .xml file to upload. Please remember to choose the correct file type when you upload – a .txt, .csv, .xml or unicode text (.csv) file.

The only requirement is that links are proper hyperlinks (excluding for .xml files) including the ‘http://’. If you upload a list of URLs without the ‘http://’, the spider will not find them. No other formatting is required, the spider will find any URLs regardless of other text contained in the file, or which columns they are in or how they are spaced. For example, you can directly upload a Adwords download and all URLs will be found automatically.

If you’re performing a site migration and wish to test URLs, we highly recommend using the ‘always follow redirects‘ configuration so the SEO Spider finds the final destination URL. The best way to view these is via the ‘redirect chains’ report.

SERP Mode

You can switch to ‘SERP mode’ and upload page titles and meta descriptions directly into the SEO Spider to calculate pixel widths (and character lengths!). There is no crawling involved in this mode, so they do not need to be live on a website.

This means you can export page titles and descriptions from the SEO Spider, make bulk edits in Excel (if that’s your preference, rather than in the tool itself) and then upload them back into the tool to understand how they may appear in Google’s SERPs.

Under ‘reports’, we have a new ‘SERP Summary’ report which is in the format required to re-upload page titles and descriptions. We simply require three headers for ‘URL’, ‘Title’ and ‘Description’.

For example –

serp-snippet-upload-format

You can upload in a .txt, .csv or Excel file.

Follow Us!

Why Purchase A Licence?

  • The 500 URI crawl limit is removed
  • You can access ALL the configuration options
  • You can save and re-upload crawls
  • You can search for anything in the source code of a website with the custom source code search feature
  • You get support for any technical issues with the software
Buy a Screaming Frog SEO Spider Licence

Contact Us

Screaming Frog Ltd
Market Chambers,
33 Market Place,
Henley-on-Thames,
Oxfordshire,
RG9 2AA

Tel: +44 (0)1491 415070
Fax: +44 (0)1491 415071
info@screamingfrog.co.uk

Latest From Twitter

  • @matthopson LOL, oops

    Retweet Reply Favorite
  • @patlangridge Hang on, I thought you were Patrick

    Retweet Reply Favorite

Looking For Something?