Table of Contents
- Installation on Windows
- Installation on macOS
- Installation on Ubuntu
- Saving, opening, exporting & importing crawls
- User agent
- Checking memory allocation
- XML sitemap creation
- Crawl analysis
- Command line interface set-up
- Command line interface
- Search function
- User Interface
Spider Crawl Tab
Spider Extraction Tab
Spider Limits Tab
Spider Rendering Tab
Spider Advanced Tab
- Cookie storage
- Ignore non-indexable URLs for on-page filters
- Ignore paginated URLs for duplicate filters
- Always follow redirects
- Always follow canonicals
- Respect noindex
- Respect canonical
- Respect next/prev
- Respect HSTS policy
- Respect self referencing meta refresh
- Extract images from img srcset attribute
- Crawl fragment identifiers
- Response timeout
- 5XX response retries
Spider Preferences Tab
Other Configuration Options
- Content area
- Spelling & grammar
- Robots.txt settings
- Custom robots.txt
- URL rewriting
- User agent
- HTTP header
- Custom search
- Custom extraction
- Custom link positions
- User Interface
- Google Analytics integration
- Google Search Console integration
- PageSpeed Insights integration
- Memory allocation
- Storage mode
Lower Window Tabs
Right Side Window Tabs
The ‘Content’ tab shows data related to the content of internal HTML URLs discovered in a crawl.
This includes details on word count, duplicate and near duplicate content, and spelling and grammar errors.
This tab includes the following columns.
- Address – The URL address.
- Word Count – This is all ‘words’ inside the body tag, excluding HTML markup. The count is based upon the content area that can be adjusted under ‘Config > Content > Area’. By default, the nav and footer elements are excluded. You can include or exclude HTML elements, classes and IDs to calculate a refined word count. Our figures may not be exactly what performing this calculation manually would find, as the parser performs certain fix-ups on invalid HTML. Your rendering settings also affect what HTML is considered. Our definition of a word is taking the text and splitting it by spaces. No consideration is given to visibility of content (such as text inside a div set to hidden).
- Closest Similarity Match – This shows the highest similarity percentage of a near duplicate URL. The SEO Spider will identify near duplicates with a 90% similarity match, which can be adjusted to find content with a lower similarity threshold. For example, if there were two near duplicate pages for a page with 99% and 90% similarity respectively, then 99% will be displayed here. To populate this column the ‘Enable Near Duplicates’ configuration must be selected via ‘Config > Content > Duplicates‘, and post ‘Crawl Analysis‘ must be performed. Only URLs with content over the selected similarity threshold will contain data, the others will remain blank. Thus by default, this column will only contain data for URLs with 90% or higher similarity, unless it has been adjusted via the ‘Config > Content > Duplicates’ and ‘Near Duplicate Similarity Threshold’ setting.
- No. Near Duplicates – The number of near duplicate URLs discovered in a crawl that meet or exceed the ‘Near Duplicate Similarity Threshold’, which is a 90% match by default. This setting can be adjusted under ‘Config > Content > Duplicates’. To populate this column the ‘Enable Near Duplicates’ configuration must be selected via ‘Config > Content > Duplicates‘, and post ‘Crawl Analysis‘ must be performed.
- Total Language Errors – The total number of spelling and grammar errors discovered for a URL. For this column to be populated then either ‘Enable Spell Check’ or ‘Enable Grammar Check’ must be selected via ‘Config > Content > Spelling & Grammar‘.
- Spelling Errors – The total number of spelling errors discovered for a URL. For this column to be populated then ‘Enable Spell Check’ must be selected via ‘Config > Content > Spelling & Grammar‘.
- Grammar Errors – The total number of grammar errors discovered for a URL. For this column to be populated then ‘Enable Grammar Check’ must be selected via ‘Config > Content > Spelling & Grammar’.
- Language – The language selected for spelling and grammar checks. This is based upon the HTML language attribute, but the language can also be set via ‘Config > Content > Spelling & Grammar‘.
- Hash – Hash value of the page using the MD5 algorithm. This is a duplicate content check for exact duplicate content only. If two hash values match, the pages are exactly the same in content. If there’s a single character difference, they will have unique hash values and not be detected as duplicate content. So this is not a check for near duplicate content. The exact duplicates can be seen under ‘Content > Exact Duplicates’.
- Indexability – Whether the URL is Indexable or Non-Indexable.
- Indexability Status – The reason why a URL is Non-Indexable. For example, if it’s canonicalised to another URL.
This tab includes the following filters.
- Exact Duplicates – This filter will show pages that are identical to each other using the MD5 algorithm which calculates a ‘hash’ value for each page and can be seen in the ‘hash’ column. This check is performed against the full HTML of the page. It will show all pages with matching hash values that are exactly the same. Exact duplicate pages can lead to the splitting of PageRank signals and unpredictability in ranking. There should only be a single canonical version of a URL that exists and is linked to internally. Other versions should not be linked to, and they should be 301 redirected to the canonical version.
- Near Duplicates – This filter will show similar pages based upon the configured similarity threshold using the minhash algorithm. The threshold can be adjusted under ‘Config > Content > Duplicates’ and is set at 90% by default. The ‘Closest Similarity Match’ column displays the highest percentage of similarity to another page. The ‘No. Near Duplicates’ column displays the number of pages that are similar to the page based upon the similarity threshold. The algorithm is run against text on the page, rather than the full HTML like exact duplicates. The content used for this analysis can be configured under ‘Config > Content > Area’. Pages can have a 100% similarity, but only be a ‘near duplicate’ rather than exact duplicate. This is because exact duplicates are excluded as near duplicates, to avoid them being flagged twice. Similarity scores are also rounded, so 99.5% or higher will be displayed as 100%. To populate this column the ‘Enable Near Duplicates’ configuration must be selected via ‘Config > Content > Duplicates‘, and post ‘Crawl Analysis‘ must be performed.
- Low Content Pages – This will show any HTML pages with a word count below 200 words by default. The word count is based upon the content area settings used in the analysis which can be configured via ‘Config > Content > Area’. There isn’t a minimum word count for pages in reality, but the search engines do require descriptive text to understand the purpose of a page. This filter should only be used as a rough guide to help identify pages that might be improved by adding more descriptive content in the context of the website and page’s purpose. Some websites, such as ecommerce, will naturally have lower word counts, which can be acceptable if a products details can be communicated efficiently. The word count used for the low content pages filter can be adjusted via ‘Config > Spider > Preferences > Low Content Word Count‘ to your own preferences.
- Spelling Errors – This filter contains any HTML pages with spelling errors. For this filter and respective columns to be populated then ‘Enable Spell Check’ must be selected via ‘Config > Content > Spelling & Grammar‘.
- Grammar Errors – This filter contains any HTML pages with grammar errors. For this column to be populated then ‘Enable Grammar Check’ must be selected via ‘Config > Content > Spelling & Grammar‘.
For more information on analysing duplicate content, please read our tutorial on ‘How To Check For Duplicate Content‘.
Join the mailing list for updates, tips & giveawaysHow we use the data in this form
Back to top