Posted 1 July, 2020 by in Screaming Frog SEO Spider

Screaming Frog SEO Spider Update – Version 13.0

We are excited to announce the release of Screaming Frog SEO Spider version 13.0, codenamed internally as ‘Lockdown’.

We’ve been busy developing exciting new features, and despite the obvious change in priorities for everyone right now, we want to continue to release updates as normal that help users in the work they do.

Let’s take a look at what’s new.

1) Near Duplicate Content

You can now discover near-duplicate pages, not just exact duplicates. We’ve introduced a new ‘Content‘ tab, which includes filters for both ‘Near Duplicates’ and ‘Exact Duplicates’.

Near Duplicates

While there isn’t a duplicate content penalty, having similar pages can cause cannibalisation issues and crawling and indexing inefficiencies. Very similar pages should be minimised and high similarity could be a sign of low-quality pages, which haven’t received much love – or just shouldn’t be separate pages in the first place.

For ‘Near Duplicates’, the SEO Spider will show you the closest similarity match %, as well as the number of near-duplicates for each URL. The ‘Exact Duplicates’ filter uses the same algorithmic check for identifying identical pages that was previously named ‘Duplicate’ under the ‘URL’ tab.

The new ‘Near Duplicates’ detection uses a minhash algorithm, which allows you to configure a near-duplicate similarity threshold, which is set at 90% by default. This can be configured via ‘Config > Content > Duplicates’.

Near Duplicates

Semantic elements such as the nav and footer are automatically excluded from the content analysis, but you can refine it further by excluding or including HTML elements, classes and IDs. This can help focus the analysis on the main content area, avoiding known boilerplate text. It can also be used to provide a more accurate word count.

Content Area Configuration

Near duplicates requires post crawl analysis to be populated, and more detail on the duplicates can be seen in the new ‘Duplicate Details’ lower tab. This displays every near-duplicate URL identified, and their similarity match.

Duplicate Details Tab

Clicking on a ‘Near Duplicate Address’ in the ‘Duplicate Details’ tab will display the near duplicate content discovered between the pages, and perform a diff to highlight the differences.

Duplicate Details Tab

The near-duplicate content threshold and content area used in the analysis can both be updated post-crawl, and crawl analysis can be re-run to refine the results, without the need for re-crawling.

The ‘Content’ tab also includes a ‘Low Content Pages’ filter, which identifies pages with less than 200 words using the improved word count. This can be adjusted to your preferences under ‘Config > Spider > Preferences’ as there obviously isn’t a one-size-fits-all measure for minimum word count in SEO.

Read our ‘How To Check For Duplicate Content‘ tutorial for more.

2) Spelling & Grammar

If you’ve found yourself with extra time under lockdown, then we know just the way you can spend it (sorry).

You’re now also able to perform a spelling and grammar check during a crawl. The new ‘Content’ tab has filters for ‘Spelling Errors’ and ‘Grammar Errors’ and displays counts for each page crawled.

Spelling & Grammar Checks

You can enable spelling and grammar checks via ‘Config > Content > Spelling & Grammar‘.

While this is a little different from our usual very ‘SEO-focused’ features, a large part of our roles are about improving websites for users. Google’s own search quality evaluator guidelines outline spelling and grammar errors numerous times as one of the characteristics of low-quality pages (if you need convincing!).

The lower window ‘Spelling & Grammar Details’ tab shows you the error, type (spelling or grammar), detail, and provides a suggestion to correct the issue.

The right-hand-side of the details tab also shows you a visual of the text from the page and errors identified.

Spelling & Gtammar Details

The right-hand pane ‘Spelling & Grammar’ tab displays the top 100 unique errors discovered and the number of URLs it affects. This can be helpful for finding errors across templates, and for building your dictionary or ignore list.

Top 100 Errors Spelling & Grammar

The new spelling and grammar feature will auto-identify the language used on a page (via the HTML language attribute), but also allow you to manually select language where required. It supports 39 languages, including English (UK, USA, Aus etc), German, French, Dutch, Spanish, Italian, Danish, Swedish, Japanese, Russian, Arabic and more.

You’re able to ignore words for a crawl, add to a dictionary (which is remembered across crawls), disable grammar rules and exclude or include content in specific HTML elements, classes or IDs for spelling and grammar checks.

Spelling & Grammar Config

You’re also able to ‘update’ the spelling and grammar check to reflect changes to your dictionary, ignore list or grammar rules without re-crawling the URLs.

Re-run Spelling & Grammar Checker

As you would expect, you can export all the data via the ‘Bulk Export > Content’ menu.

Bulk Export Spelling & Grammar Errors

Please don’t send us any ‘broken spelling/grammar’ link building emails. Check out our ‘Spell & Grammar Check Your Website‘ tutorial.

3) Improved Link Data – Link Position, Path Type & Target

Some of our most requested features have been around link data. You want more, to be able to make better decisions. We’ve listened, and the SEO Spider now records some new attributes for every link.

Link Position

You’re now able to see the ‘link position’ of every link in a crawl – such as whether it’s in the navigation, content of the page, sidebar or footer for example. The classification is performed by using each link’s ‘link path’ (as an XPath) and known semantic substrings, which can be seen in the ‘inlinks’ and ‘outlinks’ tabs.

Link Position

If your website uses semantic HTML5 elements (or well-named non-semantic elements, such as div id=”nav”), the SEO Spider will be able to automatically determine different parts of a web page and the links within them.

HTML5 Semantic Elements

But not every website is built in this way, so you’re able to configure the link position classification under ‘Config > Custom > Link Positions‘. This allows you to use a substring of the link path, to classify it as you wish.

For example, we have mobile menu links outside the nav element that are determined to be in ‘content’ links. This is incorrect, as they are just an additional sitewide navigation on mobile.

Link Position Classification

The ‘mobile-menu__dropdown’ class name (which is in the link path as shown above) can be used to define its correct link position using the Link Positions feature.

Custom Link Positions

These links will then be correctly attributed as a sitewide navigation link.

Mobile Menu Links classified

This can help identify ‘inlinks’ to a page that are only from in-body content, for example, ignoring any links in the main navigation, or footer for better internal link analysis.

Path Type

The ‘path type’ of a link is also recorded (absolute, path-relative, protocol-relative or root-relative), which can be seen in inlinks, outlinks and all bulk exports.

Link Path Type

This can help identify links which should be absolute, as there are some integrity, security and performance issues with relative linking under some circumstances.

Target Attribute

Additionally, we now show the ‘target’ attribute for every link, to help identify links which use ‘_blank’ to open in a new tab.

Target Attribute

This is helpful when analysing usability, but also performance and security – which brings us onto the next feature.

4) Security Checks

The ‘Protocol’ tab has been renamed to ‘Security‘ and more up to date security-related checks and filters have been introduced.

While the SEO Spider was already able to identify HTTP URLs, mixed content and other insecure elements, exposing them within filters helps you spot them more easily.

Security Tab

You’re able to quickly find mixed content, issues with insecure forms, unsafe cross-origin links, protocol-relative resource links, missing security headers and more.

The old insecure content report remains as well, as this checks all elements (canonicals, hreflang etc) for insecure elements and is helpful for HTTPS migrations.

The new security checks introduced are focused on the most common issues related to SEO, web performance and security, but this functionality might be extended to cover additional security checks based upon user feedback.

5) Improved UX Bits

We’ve found some new users could get confused between the ‘Enter URL to spider’ bar at the top, and the ‘search’ bar on the side. The size of the ‘search’ bar had grown, and the main URL bar was possibly a little too subtle.

Old URL Box

So we have adjusted sizing, colour, text and included an icon to make it clearer where to put your URL.

New URL Box

If that doesn’t work, then we’ve got another concept ready and waiting for trial.

URL Box Improvements

The ‘Image Details’ tab now displays a preview of the image, alongside its associated alt text. This makes image auditing much easier!

Image Details Preview

You can highlight cells in the higher and lower windows, and the SEO Spider will display a ‘Selected Cells’ count.

selected cells count

The lower windows now have filters and a search, to help find URLs and data more efficiently.

Lower Tab Search & Filters

Site visualisations now have an improved zoom, and the tree graph nodes spacing can be much closer together to view a site in its entirety. So pretty.

Crawl Tree Graph

Oh, and in the ‘View Source’ tab, you can now click ‘Show Differences’ and it will perform a diff between the raw and rendered HTML.

View Source Diff

Other Updates

Version 13.0 also includes a number of smaller updates and bug fixes, outlined below.

  • The PageSpeed Insights API integration has been updated with the new Core Web Vitals metrics (Largest Contentful Paint, First Input Delay and Cumulative Layout Shift). ‘Total Blocking Time’ Lighthouse metric and ‘Remove Unused JavaScript’ opportunity are also now available. Additionally, we’ve introduced a new ‘JavaScript Coverage Summary’ report under ‘Reports > PageSpeed’, which highlights how much of each JavaScript file is unused across a crawl and the potential savings.
  • Following the Log File Analyser version 4.0, the SEO Spider has been updated to Java 11. This means it can only be used on 64-bit machines.
  • iFrames can now be stored and crawled (under ‘Config > Spider > Crawl’).
  • Fragments are no longer crawled by default in JavaScript rendering mode. There’s a new ‘Crawl Fragment Identifiers’ configuration under ‘Config > Spider > Advanced’ that allows you to crawl URLs with fragments in any rendering mode.
  • A tonne of Google features for structured data validation have been updated. We’ve added support for COVID-19 Announcements and Image Licence features. Occupation has been renamed to Estimated Salary and two deprecated features, Place Action and Social Profile, have been removed.
  • All Hreflang ‘confirmation links’ named filters have been updated to ‘return links’, as this seems to be the common naming used by Google (and who are we to argue?). Check out our How To Audit Hreflang guide for more detail.
  • Two ‘AMP’ filters have been updated, ‘Non-Confirming Canonical’ has been renamed to ‘Missing Non-AMP Return Link’, and ‘Missing Non-AMP Canonical’ has been renamed to ‘Missing Canonical to Non-AMP’ to make them as clear as possible. Check out our How To Audit & validate AMP guide for more detail.
  • The ‘Memory’ configuration has been renamed to ‘Memory Allocation’, while ‘Storage’ has been renamed to ‘Storage Mode’ to avoid them getting mixed up. These are both available under ‘Config > System’.
  • Custom Search results now get appended to the Internal tab when used.
  • The Forms Based Authentication browser now shows you the URL you’re viewing to make it easier to spot sneaky redirects.
  • Deprecated APIs have been removed for the Ahrefs integration.

That’s everything. If you experience any problems, then please do just let us know via our support and we’ll help as quickly as possible.

Thank you to everyone for all their feature requests, feedback, and bug reports. Apologies for anyone disappointed we didn’t get to the feature they wanted this time. We prioritise based upon user feedback (and a little internal steer) and we hope to get to them all eventually.

Now, go and download version 13.0 of the Screaming Frog SEO Spider and let us know what you think!

Small Update – Version 13.1 Released 15th July 2020

We have just released a small update to version 13.1 of the SEO Spider. This release is mainly bug fixes and small improvements –

  • We’ve introduced two new reports for Google Rich Result features discovered in a crawl under ‘Reports > Structured Data’. There’s a summary of features and number of URLs they affect, and a granular export of every rich result feature detected.
  • Fix issue preventing start-up running on macOS Big Sur Beta
  • Fix issue with users unable to open .dmg on macOS Sierra (10.12).
  • Fix issue with Windows users not being able to run when they have Java 8 installed.
  • Fix TLS handshake issue connecting to some GoDaddy websites using Windows.
  • Fix crash in PSI.
  • Fix crash exporting the Overview Report.
  • Fix scaling issues on Windows using multiple monitors, different scaling factors etc.
  • Fix encoding issues around URLs with Arabic text.
  • Fix issue when amending the ‘Content Area’.
  • Fix several crashes running Spelling & Grammar.
  • Fix several issues around custom extraction and XPaths.
  • Fix sitemap export display issue using Turkish locale.

Small Update – Version 13.2 Released 4th August 2020

We have just released a small update to version 13.2 of the SEO Spider. This release is mainly bug fixes and small improvements –

  • We first released custom search back in 2011 and it was in need of an upgrade. So we’ve updated functionality to allow you to search within specific elements, entire tracking tags and more. Check out our custom search tutorial.
  • Sped up near duplicate crawl analysis.
  • Google Rich Results Features Summary export has been ordered by number of URLs.
  • Fix bug with Near Duplicates Filter not being populated when importing a .seospider crawl.
  • Fix several crashes in the UI.
  • Fix PSI CrUX data incorrectly labelled as sec.
  • Fix spell checker incorrectly checking some script content.
  • Fix crash showing near duplicates details panel.
  • Fix issue preventing users with dual stack networks to crawl on windows.
  • Fix crash using Wacom tablet on Windows 10.
  • Fix spellchecker filters missing when reloading a crawl.
  • Fix crash on macOS around multiple screens.
  • Fix crash viewing gif in the image details tab.
  • Fix crash canceling during database crawl load.