Issues

Content: Near Duplicates

back to issues overview

Near Duplicates

Pages that are near duplicate in content based upon the configured similarity threshold (set at 90% by default) using the minhash algorithm.

Near duplicate pages can cause cannibalisation issues, crawling and indexing inefficiencies and might be a sign of low quality page content.

How to Analyse in the SEO Spider

View URLs with this issue in the ‘Content’ tab and ‘Near Duplicates’ filter.

To populate this filter the ‘Enable Near Duplicates’ configuration must be selected via ‘Config > Content > Duplicates‘, and post ‘Crawl Analysis‘ must be performed. The near duplicate content threshold can be adjusted under ‘Config > Content > Duplicates’.

The ‘Closest Similarity Match’ column displays the highest percentage of similarity to another page. The ‘No. Near Duplicates’ column displays the number of pages that are similar to the page based upon the similarity threshold.

The algorithm is run against text on the page, rather than the full HTML like exact duplicates. The content used for this analysis can be configured under ‘Config > Content > Area’.

Pages can have a 100% similarity, but only be a ‘near duplicate’ rather than exact duplicate. This is because exact duplicates are excluded as near duplicates, to avoid them being flagged twice.

Similarity scores are also rounded, so 99.5% or higher will be displayed as 100%.

View near duplicates using the lower ‘Duplicate Details’ tab. Export in bulk using ‘Bulk Export > Content > Near Duplicates’.

Read our tutorial on ‘How To Check For Duplicate Content‘.

What Triggers This Issue

This issue is triggered when pages are near duplicates in content, based on a configured similarity threshold (defaulted to 90%).

How To Fix

Having very similar pages can cause cannibalisation issues and crawling and indexing inefficiencies.

Very similar pages should be minimised and high similarity could be a sign of low-quality pages, which haven’t received much love – or just shouldn’t be separate pages in the first place.

Analyse the near duplicates, considering importance of the page and scale. Then improve content to make more unique if necessary, or consolidate, block, remove, or leave as they are where appropriate.

Further Reading

Back to top