Screaming Frog SEO Spider Update – Version 5.0
Let’s get straight to it, version 5.0 includes the following new features –
1) Google Search Analytics Integration
You can now connect to the Google Search Analytics API and pull in impression, click, CTR and average position data from your Search Console profile. Alongside Google Analytics integration, this should be valuable for Panda and content audits respectively.
We were part of the Search Analytics beta, so have had this for some time internally, but delayed the release a little, while we finished off a couple of other new features detailed below, for a larger release.
For those already familiar with our Google Analytics integration, the set-up is virtually the same. You just need to give permission to our app to access data under ‘Configuration > API Access > Google Search Console’ –
The Search Analytics API doesn’t provide us with the account name in the same way as the Analytics integration, so once connected it will appear as ‘New Account’, which you can rename manually for now.
You can then select the relevant site profile, date range, device results (desktop, tablet or mobile) and country filter. Similar again to our GA integration, we have some common URL matching scenarios covered, such as matching trailing and non trailing slash URLs and case sensitivity.
When you hit ‘Start’ and the API progress bar has reached 100%, data will appear in real time during the crawl under the ‘Search Console’ tab, and dynamically within columns at the far right in the ‘Internal’ tab if you’d like to export all data together.
There’s a couple of filters currently for ‘Clicks Above 0’ when a URL has at least a single click, and ‘No GSC Data’, when the Google Search Analytics API did not return any data for the URL.
In the example above, we can see the URLs appearing under the ‘No GSC Data’ filter are all author pages, which are actually ‘noindex’, so this is as expected. Remember, you might see URLs appear here which are ‘noindex’ or ‘canonicalised’, unless you have ‘respect noindex‘ and ‘respect canonicals‘ ticked in the advanced configuration tab.
The API is currently limited to 5k rows of data, which we hope Google will increase over time. We plan to extend our integration further as well, but at the moment the Search Console API is fairly limited.
2) View & Audit URLs Blocked By Robots.txt
You can now view URLs disallowed by the robots.txt protocol during a crawl.
Disallowed URLs will appear with a ‘status’ as ‘Blocked by Robots.txt’ and there’s a new ‘Blocked by Robots.txt’ filter under the ‘Response Codes’ tab, where these can be viewed efficiently.
The ‘Blocked by Robots.txt’ filter also displays a ‘Matched Robots.txt Line’ column, which provides the line number and disallow path of the robots.txt entry that’s excluding each URL. This should make auditing robots.txt files simple!
Historically the SEO Spider hasn’t shown URLs that are disallowed by robots.txt in the interface (they were only available via the logs). I always felt that it wasn’t required as users should know already what URLs are being blocked, and whether robots.txt should be ignored in the configuration.
However, there are plenty of scenarios where using robots.txt to control crawling and understanding quickly what URLs are blocked by robots.txt is valuable, and it’s something that has been requested by users over the years. We have therefore introduced it as an optional configuration, for both internal and external URLs in a crawl. If you’d prefer to not see URLs blocked by robots.txt in the crawl, then simply untick the relevant boxes.
URLs which are linked to internally (or externally), but are blocked by robots.txt can obviously accrue PageRank, be indexed and appear under search. Google just can’t crawl the content of the page itself, or see the outlinks of the URL to pass the PageRank onwards. Therefore there is an argument that they can act as a bit of a dead end, so I’d recommend reviewing just how many are being disallowed, how well linked they are, and their depth for example.
3) GA & GSC Not Matched Report
The ‘GA Not Matched’ report has been replaced with the new ‘GA & GSC Not Matched Report’ which now provides consolidated information on URLs discovered via the Google Search Analytics API, as well as the Google Analytics API, but were not found in the crawl.
This report can be found under ‘reports’ in the top level menu and will only populate when you have connected to an API and the crawl has finished.
There’s a new ‘source’ column next to each URL, which details the API(s) it was discovered (sometimes this can be both GA and GSC), but not found to match any URLs found within the crawl.
You can see in the example screenshot above from our own website, that there are some URLs with mistakes, a few orphan pages and URLs with hash fragments, which can show as quick links within meta descriptions (and hence why their source is GSC rather than GA).
I discussed how this data can be used in more detail within the version 4.1 release notes and it’s a real hidden gem, as it can help identify orphan pages, other errors, as well as just matching problems between the crawl and API(s) to investigate.
4) Configurable Accept-Language Header
Google introduced local-aware crawl configurations earlier this year for pages believed to adapt content served, based on the request’s language and perceived location.
This essentially means Googlebot can crawl from different IP addresses around the world and with an Accept-Language HTTP header in the request. Hence, like Googlebot, there are scenarios where you may wish to supply this header to crawl locale-adaptive content, with various language and region pairs. You can already use the proxy configuration to change your IP as well.
You can find the new ‘Accept-Language’ configuration under ‘Configuration > HTTP Header > Accept-Language’.
We have some common presets covered, but the combinations are huge, so there is a custom option available which you can just set to any value required.
Smaller Updates & Fixes
That’s the main features for our latest release, which we hope you find useful. Other bug fixes and updates in this release include the following –
- The Analytics and Search Console tabs have been updated to allow URLs blocked by robots.txt to appear, which we believe to be HTML, based upon file type.
- The maximum number of Google Analytics metrics you can collect from the API has been increased from 20 to 30. Google restrict the API to 10 metrics for each query, so if you select more than 10 metrics (or multiple dimensions), then we will make more queries (and it may take a little longer to receive the data).
- With the introduction of the new ‘Accept-Language’ configuration, the ‘User-Agent’ configuration is now under ‘Configuration > HTTP Header > User-Agent’.
- We added the ‘MJ12Bot’ to our list of preconfigured user-agents after a chat with our friends at Majestic.
- Fixed a crash in XPath custom extraction.
- Fixed a crash on start up with Windows Look & Feel and JRE 8 update 60.
- Fixed a bug with character encoding.
- Fixed an issue with Excel file exports, which write numbers with decimal places as strings, rather than numbers.
- Fixed a bug with Google Analytics integration where the use of hostname in some queries was causing ‘Selected dimensions and metrics cannot be queried together errors’.
Small Update – Version 5.1 Released 22nd October 2015
We released a small update to version 5.1 of the SEO Spider, which just include some bug fixes and tweaks as below.
- Fixed issues with filter totals and Excel row numbers..
- Fixed a couple of errors with custom extraction.
- Fixed robots.txt total numbers within the overview section.
- Fixed a crash when sorting.
That’s everything for this release!
Thanks to everyone for all the suggestions and feedback for our last update, and just in general. If you spot any bugs or issues in this release, please do just drop us a note via support.
Now go and download version 5.0 of the SEO Spider!