Posted 21 July, 2016 by in Screaming Frog SEO Spider

Screaming Frog SEO Spider Update – Version 6.0

I’m excited to announce version 6.0 of the Screaming Frog SEO Spider, codenamed internally as ‘render-Rooney’.

Our team have been busy in development and have some very exciting new features ready to release in the latest update. This includes the following –

1) Rendered Crawling (JavaScript)

There were two things we set out to do at the start of the year. Firstly, understand exactly what the search engines are able to crawl and index. This is why we created the Screaming Frog Log File Analyser, as a crawler will only ever be a simulation of search bot behaviour.

Secondly, we wanted to crawl rendered pages and read the DOM. It’s been known for a long time that Googlebot acts more like a modern day browser, rendering content, crawling and indexing JavaScript and dynamically generated content rather well. The SEO Spider is now able to render and crawl web pages in a similar way.

You can choose whether to crawl the static HTML, obey the old AJAX crawling scheme or fully render web pages, meaning executing and crawling of JavaScript and dynamic content.

rendering-configuration

Google deprecated their old AJAX crawling scheme and we have seen JavaScript frameworks such as AngularJS (with links or utilising the HTML5 History API) crawled, indexed and ranking like a typical static HTML site. I highly recommend reading Adam Audette’s Googlebot JavaScript testing from last year if you’re not already familiar.

After much research and testing, we integrated the Chromium project library for our rendering engine to emulate Google as closely as possible. Some of you may remember the excellent ‘Googlebot is Chrome‘ post from Joshua G on Mike King’s blog back in 2011, which discusses Googlebot essentially being a headless browser.

The new rendering mode is really powerful, but there are a few things to remember –

  • Typically crawling is slower even though it’s still multi-threaded, as the SEO Spider has to wait longer for the content to load and gather all the resources to be able to render a page. Our internal testing suggests Google wait approximately 5 seconds for a page to render, so this is the default AJAX timeout in the SEO Spider. Google may adjust this based upon server response and other signals, so you can configure this to your own requirements if a site is slower to load a page.
  • The crawling experience is quite different as it can take time for anything to appear in the UI to start with, then all of a sudden lots of URLs appear together at once. This is due to the SEO Spider waiting for all the resources to be fetched to render a page before the data is displayed.
  • To be able to render content properly, resources such as JavaScript and CSS should not be blocked from the SEO Spider. You can see URLs blocked by robots.txt (and the corresponding robots.txt disallow line) under ‘Response Codes > Blocked By Robots.txt’. You should also make sure that you crawl JS, CSS and external resources in the SEO Spider configuration.

It’s also important to note that as the SEO Spider renders content like a browser from your machine, so this can impact analytics and anything else that relies upon JavaScript.

By default the SEO Spider excludes executing of Google Analytics JavaScript tags within its engine, however, if a site is using other analytics solutions or JavaScript that shouldn’t be executed, remember to use the exclude feature.

2) Configurable Columns & Ordering

You’re now able to configure which columns are displayed in each tab of the SEO Spider (by clicking the ‘+’ in the top window pane).

configurable columns

You can also drag and drop the columns into any order and this will be remembered (even after a restart).

To revert back to the default columns and ordering, simply right click on the ‘+’ symbol and click ‘Reset Columns’ or click on ‘Configuration > User Interface > Reset Columns For All Tables’.

3) XML Sitemap & Sitemap Index Crawling

The SEO Spider already allows crawling of XML sitemaps in list mode, by uploading the .xml file (number 8 in the ‘10 features in the SEO Spider you should really know‘ post) which was always a little clunky to have to save it if it was already live (but handy when it wasn’t uploaded!).

So we’ve now introduced the ability to enter a sitemap URL to crawl it (‘List Mode > Download Sitemap’).

xml sitemap crawling

Previously if a site had multiple sitemaps, you’d have to upload and crawl them separately as well.

Now if you have a sitemap index file to manage multiple sitemaps, you can enter the sitemap index file URL and the SEO Spider will download all sitemaps and subsequent URLs within them!

crawl sitemap index

This should help save plenty of time!

4) Improved Custom Extraction – Multiple Values & Functions

We listened to feedback that users often wanted to extract multiple values, without having to use multiple extractors. For example, previously to collect 10 values, you’d need to use 10 extractors and index selectors ([1],[2] etc) with XPath.

We’ve changed this behaviour, so by default, a single extractor will collect all values found and report them via a single extractor for XPath, CSS Path and Regex. If you have 20 hreflang values, you can use a single extractor to collect them all and the SEO Spider will dynamically add additional columns for however many are required. You’ll still have 9 extractors left to play with as well. So a single XPath such as –

multiple instances extraction

Will now collect all values discovered.

multiple instances hreflang extraction

You can still choose to extract just the first instance by using an index selector as well. For example, if you just wanted to collect the first h3 on a page, you could use the following XPath –

custom extraction single h3

Functions can also be used anywhere in XPath, but you can now use it on its own as well via the ‘function value’ dropdown. So if you wanted to count the number of links on a page, you might use the following XPath –

link count function value in xpath

I’d recommend reading our updated guide to web scraping for more information.

5) rel=“next” and rel=“prev” Elements Now Crawled

The SEO Spider can now crawl rel=“next” and rel=“prev” elements whereas previously the tool merely reported them. Now if a URL has not already been discovered, the URL will be added to the queue and the URLs will be crawled if the configuration is enabled (‘Configuration > Spider > Basic Tab > Crawl Next/Prev’).

rel=“next” and rel=“prev” elements are not counted as ‘Inlinks’ (in the lower window tab) as they are not links in a traditional sense. Hence, if a URL does not have any ‘Inlinks’ in the crawl, it might well be due to discovery from a rel=“next” and rel=“prev” or a canonical. We recommend using the ‘Crawl Path Report‘ to show how the page was discovered, which will show the full path.

There’s also a new ‘respect next/prev’ configuration option (under ‘Configuration > Spider > Advanced tab’) which will hide any URLs with a ‘prev’ element, so they are not considered as duplicates of the first page in the series.

6) Updated SERP Snippet Emulator

Earlier this year in May Google increased the column width of the organic SERPs from 512px to 600px on desktop, which means titles and description snippets are longer. Google displays and truncates SERP snippets based on characters’ pixel width rather than number of characters, which can make it challenging to optimise.

Our previous research showed Google used to truncate page titles at around 482px on desktop. With the change, we have updated our research and logic in the SERP snippet emulator to match Google’s new truncation point before an ellipses (…), which for page titles on desktop is around 570px.

updated serp snippet emulator

Our research shows that while the space for descriptions has also increased they are still being truncated far earlier at a similar point to the older 512px width SERP. The SERP snippet emulator will only bold keywords within the snippet description, not in the title, in the same way as the Google SERPs.

Please note – You may occasionally see our SERP snippet emulator be a word out in either direction compared to what you see in the Google SERP. There will always be some pixel differences, which mean that the pixel boundary might not be in the exact same spot that Google calculate 100% of the time.

We are still seeing Google play to different rules at times as well, where some snippets have a longer pixel cut off point, particularly for descriptions! The SERP snippet emulator is therefore not always exact, but a good rule of thumb.

Other Updates

We have also included some other smaller updates and bug fixes in version 6.0 of the Screaming Frog SEO Spider, which include the following –

  • A new ‘Text Ratio’ column has been introduced in the internal tab which calculates the text to HTML ratio.
  • Google updated their Search Analytics API, so the SEO Spider can now retrieve more than 5k rows of data from Search Console.
  • There’s a new ‘search query filter’ for Search Console, which allows users to include or exclude keywords (under ‘Configuration > API Access > Google Search Console > Dimension tab’). This should be useful for excluding brand queries for example.
  • There’s a new configuration to extract images from the IMG srcset attribute under ‘Configuration > Advanced’.
  • The new Googlebot smartphone user-agent has been included.
  • Updated our support for relative base tags.
  • Removed the blank line at the start of Excel exports.
  • Fixed a bug with word count which could make it less accurate.
  • Fixed a bug with GSC CTR numbers.

I think that’s just about everything! As always, please do let us know if you have any problems or spot any bugs at all.

Thanks to everyone for all the support and continued feedback. Apologies for any features we couldn’t include in this update, we are already working on the next set of updates and there’s plenty more to come!

Now go and download version 6.0 of the SEO Spider!

Small Update – Version 6.1 Released 3rd August 2016

We have just released a small update to version 6.1 of the SEO Spider. This release includes –

  • Java 8 update 66 is now required on all platforms, as this update fixes several bugs in Java.
  • Reduced certificate verification to be more tolerant when crawling HTTPS sites.
  • Fixed a crash when using the date range configuration for Google Analytics integration.
  • Fixed an issue with the lower window pane obscuring the main data window for some users.
  • Fixed a crash in custom extraction.
  • Fixed an issue in JavaScript rendering mode with the JS navigator.userAgent not being set correctly, causing sites performing UA profiling in JavaScript to miss fire.
  • Fixed crash when starting a crawl without a selection in the overview window.
  • Fixed an issue with being too strict on parsing title tags. Google seem to use them regardless of valid HTML head elements.
  • Fixed a crash for Windows XP/Vista/Server 2003/Linux 32 bit users, which are not supported for rendering mode.

Update – Version 6.2 Released 16th August 2016

We have just released a small update to version 6.2 of the SEO Spider. This release includes –

  • Fix for several crashes.
  • Fix for the broken unavailable_after in the directives filter.
  • Fix double clicking .seospider files on OS X that didn’t load the crawl file.
  • Multiple extractions instances are now grouped together.
  • Export now respects column order and visibility preferences.