Posted 19 September, 2018 by in Screaming Frog SEO Spider

Screaming Frog SEO Spider Update – Version 10.0

We are delighted to announce the release of Screaming Frog SEO Spider version 10.0, codenamed internally as ‘Liger‘.

In our last release, we announced an extremely powerful hybrid storage engine, and in this update, we have lots of very exciting new features driven entirely by user requests and feedback. So, let’s get straight to them.

1) Scheduling

You can now schedule crawls to run automatically within the SEO Spider, as a one off, or at chosen intervals.

Screaming Frog Scheduler

You’re able to pre-select the mode (spider, or list), saved configuration, as well as APIs (Google Analytics, Search Console, Majestic, Ahrefs, Moz) to pull in any data for the scheduled crawl.

scheduling start options

You can also automatically save the crawl file and export any of the tabs, filters, bulk exports, reports or XML Sitemaps to a chosen location.

scheduling export options

This should be super useful for anyone that runs regular crawls, has clients that only allow crawling at certain less-than-convenient ‘but, I’ll be in bed!’ off-peak times, uses crawl data for their own automatic reporting, or have a developer that needs a broken links report sent to them every Tuesday by 7 am.

The keen-eyed among you may have noticed that the SEO Spider will run in headless mode (meaning without an interface) when scheduled to export data – which leads us to our next point.

2) Full Command Line Interface & –Headless Mode

You’re now able to operate the SEO Spider entirely via command line. This includes launching, full configuration, saving and exporting of almost any data and reporting.

It behaves like a typical console application, and you can use –help to view the full arguments available.

CLI

You can read the full list of commands that can be supplied and how to use the command line in our updated user guide. This also allows running the SEO Spider completely headless, so you won’t even need to look at the user interface if that’s your preference (how rude!).

We believe this can be an extremely powerful feature, and we’re excited about the new and unique ways users will utilise this ability within their own tech stacks.

3) Indexability & Indexability Status

This is not the third biggest feature in this release, but it’s important to understand the concept of indexability we have introduced into the SEO Spider, as it’s integrated into many old and new features and data.

Every URL is now classified as either ‘Indexable‘ or ‘Non-Indexable‘.

Indexability & Indexability Status

These two phrases are now commonplace within SEO, but they don’t have an exact definition. For the SEO Spider, an ‘Indexable’ URL means a page that can be crawled, responds with a ‘200’ status code and is permitted to be indexed.

This might differ a little from the search engines, which will index URLs which can’t be crawled and content that can’t be seen (such as those blocked by robots.txt) if they have links pointing to them. The reason for this is for simplicity, it helps to bucket and organise URLs into two distinct groups of interest.

Each URL will also have an indexability status associated with it for quick reference. This provides a reason why a URL is ‘non-indexable’, for example, if it’s a ‘Client Error’, ‘Blocked by Robots.txt, ‘noindex’, ‘Canonicalised’ or something else (and perhaps a combination of those).

This was introduced to make auditing more efficient. It makes it easier when you export data from the internal tab, to quickly identify which URLs are canonicalised for example, rather than having to run a formula in a spreadsheet. It makes it easier at a glance to review whether a URL is indexable when reviewing page titles, rather than scanning columns for canonicals, directives etc. It also allows the SEO Spider to use a single filter, or two columns to communicate a potential issue, rather than six or seven.

4) XML Sitemap Crawl Integration

It’s always been possible to crawl XML Sitemaps directly within the SEO Spider (in list mode), however, you’re now able to crawl and integrate them as part of a site crawl.

You can select to crawl XML Sitemaps under ‘Configuration > Spider’, and the SEO Spider will auto-discover them from robots.txt entry, or the location can be supplied.

integrated XML Sitemap crawling

The new Sitemaps tab and filters allow you to quickly analyse common issues with your XML Sitemap, such as URLs not in the sitemap, orphan pages, non-indexable URLs and more.

non-indexable urls in sitemap

You can also now supply the XML Sitemap location into the URL bar at the top, and the SEO Spider will crawl that directly, too (instead of switching to list mode).

5) Internal Link Score

A useful way to evaluate and improve internal linking is to calculate internal PageRank of URLs, to help get a clearer understanding about which pages might be seen as more authoritative by the search engines.

The SEO Spider already reports on a number of useful metrics to analyse internal linking, such as crawl depth, the number of inlinks and outlinks, the number of unique inlinks and outlinks, and the percentage of overall URLs that link to a particular URL. To aid this further, we have now introduced an advanced ‘link score’ metric, which calculates the relative value of a page based upon its internal links.

Internal Link Score

This uses a relative 0-100 point scale from least to most value for simplicity, which allows you to determine where internal linking might be improved.

The link score metric algorithm takes into consideration redirects, canonicals, nofollow and much more, which we will go into more detail in another post.

This is a relative mathematical calculation, which can only be performed at the end of a crawl when all URLs are known. Previously, every calculation within the SEO Spider has been performed at run-time during a crawl, which leads us on to the next feature.

6) Post Crawl Analysis

The SEO Spider is now able to perform further analysis at the end of a crawl (or when it’s stopped) for more data and insight. This includes the new ‘Link Score’ metric and a number of other new filters that have been introduced.

Crawl analysis can be automatically performed at the end of a crawl, or it can be run manually by the user. This can be viewed under ‘Crawl Analysis > Configure’ and the crawl analysis can be started by selecting ‘Crawl Analysis > Start’. When the analysis is running, the SEO Spider can continue to be used as normal.

crawl analysis

When the crawl analysis has finished, the empty filters which are marked with ‘Crawl Analysis Required’, will be populated with data.

Most of these items were already available via reports, but this new feature brings them into the interface to make them more visible, too.

7) Visualisations

We have a confession. We have always loved the idea of crawl visualisations, but have always had a problem with them – they were rarely actionable. They too frequently don’t help diagnose actual problems, hide data, and often don’t reflect the real world view of a crawl either. Although, they have always looked pretty, and some SEOs are able to read them like a piece of abstract art. That said, the actual concept is fun and exciting, and due to overwhelming popular demand, we went to the drawing board.

Portent originally introduced the concept of force-directed diagrams to the SEO industry, and a few providers already provide various useful site visualisations (and kudos to them), but we don’t believe any of them were perfect, and we wanted to see if we could challenge our own assumptions of their limits.

We wanted to build a better way of visually understanding a site, its architecture, internal link structure and issues. We wanted to make them scalable, and we didn’t want to have to hide data from users to make them work.

So, we have introduced two types of diagrams, and two different perspectives on viewing a site, each with their own benefits that we believe provide more actionable data, and insight.

These include two crawl visualisations, and two directory tree visualisations.

SEO visualisations

Crawl Visualisations

The force-directed crawl diagram and crawl graph visualisations are useful for analysis of internal linking, as they provide a view of how the SEO Spider has crawled the site, by shortest path to a page. Here’s how our own website can be seen with our force-directed crawl diagram.

force directed crawl diagram

Indexable pages are represented by the green nodes, the darkest, largest circle in the middle is the start URL (the homepage), and those surrounding it are the next level deep, and they get further away, smaller and lighter with increasing crawl depth (like a heatmap).

One of the problems with crawl visualisations is scale. They are really memory intensive and the force-directed crawl diagram visualisations do not scale very well due to the amounts of data. The browser will start to grind to a halt at anything above 10k URLs, unless interactivity and other bells and whistles were removed, which would be a shame, as that’s part of their appeal. However, it’s sites on a larger scale that need visualisations the most, to really understand them.

So, as site architecture doesn’t start and end at the homepage, our visualisations can be viewed from any URL.

The visualisation will show up to 10k URLs in the browser, but allow you to right-click and ‘focus’ to expand on particular areas of a site to show more URLs in that section (up to another 10k URLs at a time). You can use the browser as navigation, typing in a URL directly and moving forwards and backwards with ease.

force directed crawl digram right click focus

You can also right-click on any URL in a crawl, and open up a visualisation from that point as a visual URL explorer.

right click visualisations

When a visualisation has reached the 10k URL limit, it lets you know when a particular node has children that are being truncated (due to size limits), by colouring the nodes grey. You can then right click and ‘explore’ to see the children. This way, every URL in a crawl can be visualised.

right click focus to view truncated children

The pastel red highlights URLs are non-indexable, which makes it quite easy to spot problematic areas of a website. There are valid reasons for non-indexable pages, but visualising their proportion and where they are, can be useful in quickly identifying areas of interest to investigate further.

We also took the force-directed diagrams a few steps further, to allow a user to completely configure them visually, in size of nodes, overlap, separation, colour, link length and when to display text.

visualisation configuration

After all, they are arguably more like works of art.

Pretty force-directed crawl diagram

More significantly, you also have the ability to scale visualisations by other metrics to provide greater insight, such as unique inlinks, word count, GA Sessions, GSC Clicks, Link Score, Moz Page Authority and more.

The size and colour of nodes will scale based upon these metrics, which can help visualise many different things alongside internal linking, such as sections of a site which might have thin content.

low content using force-directed diagram

Or highest value by link score.

link score using force-directed diagram

It can be hard to quickly see pages in a force-directed diagram, as beautiful as they are. So you can also view internal linking in a more simplistic crawl tree graph, which can be configured to display left to right, or top to bottom (or bottom to top, if you’re slightly weird).

crawl tree visualisation

You can right click and ‘focus’ on particular areas of the site. You can also expand or collapse up to a particular crawl depth, and adjust the level and node spacing, to get it just right.

crawl tree visualisation focused

Like the force-directed diagrams, all the colours can also be adjusted for fun (or if you have to be boring and use brand colours).

Directory Tree Visualisations

The ‘Directory Tree’ view in the SEO Spider has been a favourite of users for a long time, and we wanted to introduce this into our visualisations.

The key differentiator is that it helps to understand a site’s URL architecture, and the way it’s organised, as opposed to internal linking of the crawl visualisations. This can be useful, as these groupings often share the same page templates, and SEO issues (but not always).

The force-directed directory tree diagram is unique to the SEO Spider and you can see it’s very different for a crawl of our site than the previous crawl diagram and easier to visualise potential problems.

force-directed directory tree diagram

Notice how the non-indexable red nodes are organised together, as they have the same template, whereas in the crawl diagram they are distributed throughout. This view often makes it easier to see patterns.

This can also be viewed in a simplistic directory tree graph format, too. These graphs are interactive and here’s a zoomed in, top-down view of a section of our website.

directory tree graph

While the SEO Spider’s visualisations don’t solve all the problems mentioned at the outset of this feature, they are a step in the right direction to making them more insightful, a truer representation of a site, and ultimately, useful.

We believe there might be a sweet spot middle ground between the crawl and directory tree visualisations, but that’s work in progress. If there are any further scale metrics you’d like to see introduced into these visualisations, then do let us know.

Anchor & Body Text Word Clouds

Due to our visualisation integration, you can also visualise all internal anchors to a URL, and the body text of a page.

inlink anchor text word cloud

These options are available via right-clicking a URL and choosing ‘Visualisations’.

8) AMP Crawling & Validation

You can now automatically extract and crawl accelerated mobile pages (AMP), analyse and validate them.

You can quickly identify various common AMP issues via the new AMP tab and filters, such as errors, missing canonicals or non-confirming links with the desktop version.

I don't like AMP :-)

The AMP Validator has also been integrated into the SEO Spider, so you can crawl and identify any validation issues at scale. This includes the exact checks from the AMP Validator, for all required HTML as per the specification, and disallowed HTML.

9) Canonicals & Pagination Tabs & Filters

Canonicals and pagination were previously included under the directives tab. However, neither are directives and while they are useful to view in combination with each other, we felt they were deserving of their own tabs, with their own set of finely tuned filters, to help identify issues faster.

So, both have their own new tabs with updated and more granular filters. This also helps expose data that was only previously available within reports, directly into the interface. For example, the new canonicals tab now includes a ‘Non-Indexable Canonical’ filter which could only be seen previously by reviewing response codes, or viewing ‘Reports > Non-Indexable Canonicals’.

canonicals tab

Pagination is something websites get wrong an awful lot, it’s nearly at hreflang levels. So, there’s now a bunch of useful ways to filter paginated pages under the pagination tab to identify common issues, such as non-indexable paginated pages, loops, or sequence errors.

pagination tab

The more comprehensive filters should help make identifying and fixing common pagination errors much more efficient.

10) Improved Redirect & Canonical Chain Reports

The SEO Spider now reports on canonical chains and ‘mixed chains’, which can be found in the renamed ‘Redirect & Canonical Chains’ report.

redirect & canonical chains report

For example, the SEO Spider now has the ability to report on mixed chain scenarios such as, redirect to a URL which is canonicalised to another URL, which has a meta refresh to another URL, which then JavaScript redirects back to the start URL. It will identify this entire chain, and report on it.

The updated report has also been updated to have fixed position columns for the start URL, and final URL in the chain, and reports on the indexability and indexability status of the final URL to make auditing more efficient to see if a redirect chain ends up at a ‘noindex’ or ‘error’ page etc. The full hops in the chain are still reported as previously, but in varying columns afterwards.

This means auditing redirects is significantly more efficient, as you can quickly identify the start and end URLs, and discover the chain type, the number of redirects and the indexability of the final target URL immediately. There’s also flags for chains where there is a loop, or have a temporary redirect somewhere in the chain.

There simply isn’t a better tool anywhere for auditing redirects at scale, and while a feature like visualisations might receive all the hype, this is significantly more useful for technical SEOs in the trenches every single day. Please read our updated guide on auditing redirects in a site migration.

Other Updates

Version 10.0 also includes a number of smaller updates and bug fixes, outlined below.

  • You’re now able to automatically load new URLs discovered via Google Analytics and Google Search Console, into a crawl. Previously new URLs discovered were only available via the orphan pages report, this now configurable. This option can be found under ‘API Access > GA/GSC > General’.
  • ‘Non-200 Hreflang URLs’ have now been moved into a filter under the ‘Hreflang‘ tab.
  • You can disable respecting HSTS Policy under the advanced configuration (to retrieve true redirect status codes easier, rather than an internal 307) .
  • The ‘Canonical Errors’ report has been renamed to ‘Non-Indexable Canonicals’ and is available under reports in the top level menu.
  • The ‘rel=”next” and rel=”prev” Errors’ report has been adjusted to ‘Pagination’ > ‘Non-200 Pagination URLs’ and ‘Unlinked Pagination URLs’ reports.
  • Hard disk space has been reduced by around 30% for crawls in database storage mode.
  • Re-spidering of URLs in bulk on larger crawls is faster and more reliable.
  • There are new ‘bulk exports’ for Sitemaps and AMP as you would expect.
  • The main URL address bar at the top is now much wider.
  • Donut charts and right click highlighting colours have been updated.
  • There’s a new ‘Always Follow Canonicals‘ configuration item for list mode auditing.
  • The 32k character limit for custom extraction has been removed.
  • ‘rel=”next” and rel=”prev” are now available in the ‘Internal’ tab.
  • ‘Max Redirects To Follow’ configuration has been moved under the ‘Limits’ tab.
  • There’s now a ‘resources’ lower window tab, which includes (you guessed it), resources.
  • The Google Search Console integration website profiles list is now searchable.
  • The include and exclude configuration, now have a ‘test’ tab to help test your regex pre a crawl.
  • There’s a new splash screen on start-up.
  • There’s a bunch of new right click options for popular checks with other tools such as PageSpeed Insights, the Mobile Testing Tool etc.

That’s everything. If you made it this far and are still reading, thank you for caring. Thank you to everyone for the many feature requests and feedback, which have helped the SEO Spider improve so much over the past 8 years.

If you experience any problems with the new version, then please do just let us know via support and we can help.

Now, go and download version 10.0 of the Screaming Frog SEO Spider.

Small Update – Version 10.1 Released 21st September 2018

We have just released a small update to version 10.1 of the SEO Spider. This release is mainly bug fixes and small improvements –

  • Fix issue with no URLs displaying in the UI when ‘Respect Next/Prev’ is ticked.
  • Stop visualisations popping to the front after displaying pop-ups in the main UI.
  • Allow configuration dialogs to be resized for users on smaller screens.
  • Update include & exclude test tabs to show the encoded URL that the regular expressions are run against.
  • Fix a crash when accessing GA/GSC via the scheduling UI.
  • Fix crash when running crawl analysis with no results.
  • Make tree graph and force-directed diagram fonts configurable.
  • Fix issue with bold and italic buttons not resetting to default on graph config panels.

Small Update – Version 10.2 Released 3rd October 2018

We have just released a small update to version 10.2 of the SEO Spider. This release is mainly bug fixes and small improvements –

  • –headless can now be run on Ubuntu under Windows.
  • Added configuration option “Respect Self Referencing Meta Refresh” (Configuration > Spider > Advanced). Lots of websites have self-referencing meta refereshes, which can be classed as ‘non-indexable’, and this can now simply be switched off.
  • URLs added to the crawl via GA/GSC now got through URL rewriting and exclude configuration.
  • Various scheduling fixes.
  • The embedded browser now runs in a sandbox.
  • The Force-Directed Diagram directory tree now considers non-trailing slash URLs as potential directories, and doesn’t duplicate where appropriate.
  • Fixed bug with ‘Custom > Extraction’ filter missing columns when run headless.
  • Fixed issue preventing crawls saving with more than 32k of custom extraction data.
  • Fixed issue with ‘Link Score’ not being saved/restored.
  • Fixed crash when accessing the Forms Based Authentiction.
  • Fixed crash when uploading duplicate SERP URLs.
  • Fixed crashes introduced by update to macOS 10.14 Mojave.