How To Audit Canonicals Using The SEO Spider
The rel=”canonical” element helps specify a single preferred version of a page when it’s available via multiple URLs. It’s a hint to the search engines to help prevent duplicate content, by consolidating indexing and link properties to a single URL to use in ranking.
This tutorial walks you through how you can use the Screaming Frog SEO Spider to audit canonical implementation quickly and efficiently across a website. The SEO Spider will crawl canonical link elements found in the HTML and HTTP Headers and report upon their set-up and common errors.
To get started, you’ll need to download the SEO Spider which is free in lite form, for crawling up to 500 URLs. You can download via the buttons in the right hand side bar. Next up, simply follow these steps.
1) Ensure ‘Crawl Canonicals’ is Enabled under ‘Configuration > Spider’
This option is enabled by default, so unless you have adjusted the configuration this will already be set-up. The SEO Spider ‘Configuration’ is available in the top level menu.
This will mean URLs referenced in rel=”canonical” will be crawled, as well as extracted and reported. Next, click ‘OK’.
2) Crawl The Website
Open up the SEO Spider, type or copy in the website you wish to crawl in the ‘Enter URL to spider’ box and hit ‘Start’.
The website and any URLs within rel=”canonical” elements will be crawled.
Now grab a coffee and wait until the progress bar reaches 100%, and the crawl is completed.
3) View The Canonicals Tab
The Canonicals tab shows all URLs found in a crawl and their corresponding rel=”canonical” link elements and HTTP Canonicals in seperate corresponding columns in the main window pane.
The canonicals tab has 6 filters that help you understand your canonical implementation, and identify common canonical problems.
The ‘Occurences’ column counts the number of rel=”canonical” elements that has been discovered for each URL.
The right hand overview window pane provides a summary of data contained within each tab and filter, so you know where to click, without having to check each filter to see if there’s data. In the image below, we can see there’s 1 URL that’s ‘canonicalised’ and 1 URL that has a ‘Non-Indexable Canonical’.
You’re able to filter by the following –
- Contains Canonical – The page has a canonical URL set (either via link element, HTTP header or both). This could be a self-referencing canonical URL where the page URL is the same as the canonical URL, or it could be ‘canonicalised’, where the canonical URL is different to the page URL.
- Self Referencing – The URL has a canonical which is the same URL as the page URL crawled (hence, it’s self referencing). Ideally only canonical versions of URLs would be linked to internally, and every URL would have a self-referencing canonical to help avoid any potential duplicate content issues that can occur (even naturally on the web, such as tracking parameters on URLs, other websites incorrectly linking to a URL that resolves etc).
- Canonicalised – The page has a canonical URL that is different to itself. The URL is ‘canonicalised’ to another location. This means the search engines are being instructed to not index the page, and the indexing and linking properties should be consolidated to the target canonical URL. These URLs should be reviewed carefully. In a perfect world, a website wouldn’t need to canonicalise any URLs as only canonical versions would be linked to, but often they are required due to various circumstances outside of control, and to prevent duplicate content.
- Missing – There’s no canonical URL present either as a link element, or via HTTP header. If a page doesn’t indicate a canonical URL, Google will identify what they think is the best version or URL. This can lead to ranking unpredicatability, and hence generally all URLs should specify a canonical version.
- Multiple – There’s multiple canonicals set for a URL (either multiple link elements, HTTP header, or both combined). This can lead to unpredictability, as there should only be a single canonical URL set by a single implementation (link element, or HTTP header) for a page.
- Non-Indexable Canonical – The canonical URL is a non-indexable page. This will include canonicals which are blocked by robots.txt, no response, redirect (3XX), client error (4XX), server error (5XX) or are ‘noindex’. Canonical versions of URLs should always be indexable, ‘200’ response pages. Therefore, canonicals that go to non-indexable pages should be corrected to the resolving indexable versions.
4) View Non-Indexable Canonical URLs ‘Indexability Status’ Via The Lower Window Pane ‘URL Info’ Tab
The ‘URL Info’ tab at the bottom displays the reason why a canonical is non-indexable. As per the example below, this canonical URL is non-indexable because it’s redirected.
The canonical URL is ‘https://www.thelightingsuperstore.co.uk/clearance-lighting/clearance-stock-light-fittings’, which redirects. Hence, this is considered as ‘non-indexable’.
5) Use The ‘Reports > Non-Indexable Canonicals’ Export To Bulk Export Source URLs, Non-Indexable Canonical URLs & Their Status Codes
To bulk export details of source pages that contain non-indexable canonicals, their respective indexability, indexability status, status and status code, click ‘reports’ in the top level menu and choose ‘Non-Indexable Canonicals’.
This export is often much easier to digest and work through to fix, (or send onto a developer to fix). It also includes details of any canonical URLs that are ‘unlinked’ in the crawl, via normal HTML anchor elements.
6) Click ‘Reports > Redirect & Canonical Chains’ Report To View Chained Canonicals & Loops
Similar to redirects, canonicals can also be chained and have loops. A page URL can be canonicalised to another URL, which is canonicalised to another URL and so on. Or, often a combination of both canonicals and redirects together.
Once this report has been exported, filter the ‘Chain Type’ column for ‘Canonical’ or ‘Mixed’ to view canonical chains. In the example above, we can see there is a ‘mixed’ redirect loop, due to the non-indexable canonical URL.
The image below shows the exported spreadsheet, showing there’s two ‘redirects’ (which really means ‘hops’, as it can include canonicalised URLs), the start ‘address’ and the ‘final address’ in fixed columns. The final address indexability is ‘non-indexable’, as it’s ‘canonicalised’. Click the image to expand.
Scrolling to the right of the spreadsheet, each of the hops that have been discovered are shown. We can see the address has a canonical redirect with a 301 status code, that goes back to the start URL (causing a loop). Again, you can click on the image to expand it.
To summarise the spread sheet, the canonical chains export shows the https://www.thelightingsuperstore.co.uk/clearance-lighting page has a canonical URL set as https://www.thelightingsuperstore.co.uk/clearance-lighting/clearance-stock-light-fittings.
However, the https://www.thelightingsuperstore.co.uk/clearance-lighting/clearance-stock-light-fittings canonical URL actually 301 redirects back to the original https://www.thelightingsuperstore.co.uk/clearance-lighting parent page.
While this isn’t a big problem, it is a conflicting signal for the search engines and should be corrected. There might be some scenarios where canonical chains are much larger, and more complicated, and this report will help identify and highlight the error, and full path in the chain.
The guide above should help illustrate the simple steps required to audit rel=”canonical” across a website using the SEO Spider.
If you have any further queries, then just get in touch via support.