Screaming Frog SEO Spider Update – Version 4.0
I’m really pleased to announce version 4.0 of the Screaming Frog SEO Spider, codenamed internally as ‘Ella’.
We have been busy in development working on some significant improvements to the SEO Spider due for release later this year, which include a number of powerful new features we’ve wanted to release for a very longtime. Rather than wait, we decided to release some of these features now, with much more on the way.
Therefore, version 4.0 has two new big features in the release, and here’s the full details –
1) Google Analytics Integration
You can now connect to the Google Analytics API and pull in data directly during a crawl.
To get a better understanding of a website’s organic performance, it’s often useful to map on-page elements with user data and SEOs have for a long-time combined crawl data with Google Analytics in Excel, particularly for Panda and content audits. GA data is seamlessly fetched and matched to URLs in real time as you crawl, so you often see data start appearing immediately, which we hope makes the process more efficient.
The SEO Spider not only fetches user and session metrics, but it can also collect goal conversions and ecommerce (transactions and revenue) data for landing pages, so you can view your top performing pages when performing a technical or content audit.
If you’re running an Adwords campaign, you can also pull in impressions, clicks, cost and conversion data and we will match your destination URLs against the site crawl, too. You can also collect other metrics of interest, such as Adsense data (Ad impressions, clicks revenue etc), site speed or social activity and interactions.
To set this up, start the SEO Spider and go to ‘Configuration > API Access > Google Analytics’.
Then you just need to connect to a Google account (which has access to the Analytics account you wish to query) by granting the ‘Screaming Frog SEO Spider’ app permission to access your account to retreive the data. Google APIs use the OAuth 2.0 protocol for authentication and authorisation.
Once you have connected, you can choose the relevant Analytics account, property, view, segment and date range!
Then simply select the metrics that you wish to fetch. The SEO Spider currently allow you to select up to 20, which we might extend further. If you keep the number of metrics to 10 or below with a single dimension (as a rough guide), then it will generally be a single API query per 10k URLs, which makes it super quick –
There are circumstances where URLs in Google Analytics might not match URLs in a crawl, so we have a couple of common scenarios covered in our configuration, such as matching trailing and non-trailing slash URLs and case sensitivity (upper and lowercase characters in URLs).
If you have millions of URLs in GA, you can also choose to limit the number of URLs to query, which is by default ordered by sessions to return the top performing page data.
When you hit ‘start’ to crawl, the Google Analytics data will then be fetched and display in respective columns within the ‘Internal’ and ‘Analytics’ tabs. There’s a separate ‘Analytics’ progress bar in the top right and when this has reached 100%, crawl data will start appearing against URLs. Fetching the data from the API is independent of the crawl, and it doesn’t slow down crawl speed itself.
There are 3 filters currently under the ‘Analytics’ tab, which allow you to filter by ‘Sessions Above 0’, ‘Bounce Rate Above 70%’ and ‘No GA Data’. ‘No GA Data’ means that for the metrics and dimensions queried, the Google API didn’t return any data for the URLs in the crawl. So the URLs either didn’t receive any visits (sorry, ‘sessions’), or perhaps the URLs in the crawl are just different to those in GA for some reason.
For our site, we can see there is ‘no GA data’ for blog category pages and a few old blog posts, as you would expect really (the query was landing page, rather than page). Remember, you may see pages appear here which are ‘noindex’ or ‘canonicalised’, unless you have ‘respect noindex‘ and ‘respect canonicals‘ ticked in the advanced configuration tab.
Please note – If GA data does not get pulled into the SEO Spider as you expected, then analyse the URLs in GA under ‘Behaviour > Site Content > All Pages’ and ‘Behaviour > Site Content > Landing Pages’ depending on your query.
If they don’t match URLs in the crawl, then GA data won’t able to be matched up and appear in the SEO Spider. We recommend checking your default Google Analytics view settings (such as ‘default page’) and filters which all impact how URLs are displayed and hence matched against a crawl. If you want URLs to match up, you can often make the required amends within Google Analytics.
This is just our first iteration and we have some more advanced crawling, matching, canonicalisation and aggregation planned which will help in more complicated scenarios and provide further insights.
2) Custom Extraction
The new ‘custom extraction’ feature allows you to collect any data from the HTML of a URL. As many of you will know, our original intention was always to extend the existing ‘custom search’ feature, into ‘custom extraction’, which has been one of the most popular requests we have received.
If you’re familiar with scraping using import XML and Xpath, SeoTools for Excel, Scraper for Chrome or our friends at URL Profiler, then you’ll be at home using Xpath & CSS Path selectors and using the custom extraction feature.
You’ll find the new feature under ‘Configuration > Custom’.
‘Search’ is of course the usual custom source code search feature you should be familiar with –
‘Extraction’ is similar to the custom search feature, you have 10 fields which allow you to extract anything from the HTML of a web page by using either Xpath, CSS Path or failing those, regex. You can include the attribute as usual in XPath and for CSS Path an optional attribute field will appear after selection.
When using XPath or CSS Path to collect HTML, you can choose what to extract:
- Extract HTML Element: The selected element in full and the HTML content inside.
- Extract Inner HTML: The HTML content inside of the selected element. If the selected element contains other HTML elements, they will be included.
- Extract Text: The text content of the selected element, and the text content of any sub elements (essentially the HTML stripped entirely!).
You’ll get a lovely tick next to your regex, Xpath or CSS Path if it’s valid syntax. If you’ve made a mistake, a red cross will remain in place!
This will allow you to collect any data that we don’t currently as default or anything unique and specific to the work you’re performing. For example, Google Analytics IDs, schema, social meta tags (Open Graph Tags & Twitter cards), mobile annotations, hreflang values, as well as simple things like price of products, discount rates, stock availability etc. Let’s have a look at collecting some of these, with specific examples below.
Authors & Comments
Lets say I wanted to know the authors of every blog post on the site and the number of comments each have received. All I need to do is open up a blog post in Chrome, right click and ‘inspect element’ on the data I want to collect and right click again on the relevant HTML and copy the relevant CSS path or XPath. If you use Firebug in Firefox, then you can do the same there, too.
I can also name the ‘extractors’, which correspond to the column names in the SEO Spider. In this example, I’ve used a CSS Path and XPath for the fun of it.
The author names and number of comments extracted then shows up under the ‘extraction’ filter in the ‘custom’ tab, as well as the ‘internal’ tab allowing you to export everything collected all together into Excel.
Google Analytics ID
Traditionally the custom search feature has been really useful to ensure tracking tags are present on a page, but perhaps sometimes you may wish to pull the specific UA ID. Let’s use regex for this one, it would be –
And the data extracted is as you would expect –
The Xpath would just be –
Alongside the ‘Extract Text’ option. The first h3s returned are as follows –
If you wanted to pull mobile annotations from a website, you might use an Xpath such as –
//link[contains(@media, '640') and @href]/@href
Which for the Huffington Post would return –
We are working on a more robust report for hreflang as there can be so many types of errors and problems with the set-up, but in the meantime you can collect them using the new custom extraction feature as well. You might need to know how many hreflang there are on a page first –
The above will collect the entire HTML element, with the link and hreflang value.
So, perhaps you wanted just the hreflang values, you could specify the attribute using @hreflang –
This would just collect the language values –
Social Meta Tags
You may wish to extract social meta tags, such as Facebook Open Graph tags or Twitter Cards. Your set-up might be something like –
Which on Moz, would collect some lovely social meta tag data –
You may wish to collect the types of various Schema on a page, so the set-up might be –
And the extracted data is –
Perhaps you wanted to collect email addresses from your website or websites, the Xpath might be something like –
//a[starts-with(@href, 'mailto')] //a[starts-with(@href, 'mailto')]
From our website, this would return the two email addresses we have in the footer on every page –
That’s enough examples for now, and I am sure you will all have plenty of other smart ways this feature can be used. My Xpath is probably fairly poor, so feel free to improve! You can read our web scraping guide with more examples.
To help us avoid a sudden influx of queries about Xpath, CSS Path and regex syntax, please do read some of the really useful guides out there on each, before submitting a support query. If you have any guides you’d like to share in the comments, they are more than welcome too!
Some of you may notice there is an ‘address’ and ‘URL encoded Address’ field within the ‘URL Info’ tab in the lower window pane. This is because we have carried out a lot of internal research around URL encoding and how Google crawl and index them. While this feature isn’t as exiciting as those above, it’s a fairly significant improvement.
When loading in crawl files saved from previous releases the new ‘URL Encoded Address’ and ‘Address’ are not updated to the new behaviour implemented in version 4.0. If you want this information to be 100% accurate, you will need to re-crawl the site.
We have also performed other bug fixes and smaller updates in version 4.0 of the Screaming Frog SEO Spider, which include the following –
- Improved performance for users using large regex’s in the custom filter & fixed a bug not being able to resume crawls with these quickly.
- Fixed an issue reported by Kev Strong, where the SEO Spider was unable to crawl urls with an underscore in the hostname.
- Fixed X-Robots-Tags header to be case insensitive, as reported by Merlinox.
- Fixed a URL encoding bug.
- Fixed a bug with displaying HTML content length as string length, rather than length in bytes.
- Fixed a bug where manual entry in list mode doesn’t work if a file upload has happened previously.
- Fixed a crash when opening the SEO Spider in SERP mode and hovering over bar graph which should then display a tooltip.
Small Update – Version 4.1 Released 16th July 2015
We have just released a small update to version 4.1 of the Screaming Frog SEO Spider. There’s a new ‘GA Not Matched’ report in this release, as well as some bug fixes. This release includes the following –
GA Not Matched Report
We have released a new ‘GA Not Matched’ report, which you can find under the ‘reports’ menu in the top level navigation.
Data within this report is only available when you’ve connected to the Google Analytics API and collected data for a crawl. It essentially provides a list of all URLs collected from the GA API, that were not matched against the URLs discovered within the crawl.
This report can include anything that GA returns, such as pages in a shopping cart, or logged in areas. Hence, often the most useful data for SEOs is returned by querying the landing page path dimension and ‘organic traffic’ segment. This can then help identify –
- Orphan Pages – These are pages that are not linked to internally on the website, but do exist. These might just be old pages, those missed in an old site migration or pages just found externally (via external links, or referring sites). This report allows you to browse through the list and see which are relevant and potentially upload via list mode.
- Errors – The report can include 404 errors, which sometimes include the referring website within the URL as well (you will need the ‘all traffic’ segment for these). This can be useful for chasing up websites to correct external links, or just 301 redirecting the URL which errors, to the correct page! This report can also include URLs which might be canonicalised or blocked by robots.txt, but are actually still indexed and delivering some traffic.
- GA URL Matching Problems – If data isn’t matching against URLs in a crawl, you can check to see what URLs are being returned via the GA API. This might highlight any issues with the particular Google Analytics view, such as filters on URLs, such as ‘extended URL’ hacks etc. For the SEO Spider to return data against URLs in the crawl, the URLs need to match up. So changing to a ‘raw’ GA view, which hasn’t been touched in anyway, might help.
Other bug fixes in this release include the following –
- Fixed a couple of crashes in the custom extraction feature.
- Fixed an issue where GA requests weren’t going through the configured proxy.
- Fixed a bug with URL length, which was being incorrectly reported.
- We changed the GA range to be 30 days back from yesterday to match GA by default.
I believe that’s everything for now and please do let us know if you have any problems or spot any bugs via our support. Thanks again to everyone for all their support and feedback as usual.
Now go and download version 4.1 of the SEO Spider!