Table of Contents
- Installation on Windows
- Installation on macOS
- Installation on Ubuntu
- Saving, opening, exporting & importing crawls
- User agent
- Checking memory allocation
- XML sitemap creation
- Crawl analysis
- Command line interface set-up
- Command line interface
- Search function
- User Interface
Spider Crawl Tab
Spider Extraction Tab
Spider Limits Tab
Spider Rendering Tab
Spider Advanced Tab
- Cookie storage
- Ignore non-indexable URLs for on-page filters
- Ignore paginated URLs for duplicate filters
- Always follow redirects
- Always follow canonicals
- Respect noindex
- Respect canonical
- Respect next/prev
- Respect HSTS policy
- Respect self referencing meta refresh
- Extract images from img srcset attribute
- Crawl fragment identifiers
- Response timeout
- 5XX response retries
Spider Preferences Tab
Other Configuration Options
- Content area
- Spelling & grammar
- Robots.txt settings
- Custom robots.txt
- URL rewriting
- User agent
- HTTP header
- Custom search
- Custom extraction
- Custom link positions
- User Interface
- Google Analytics integration
- Google Search Console integration
- PageSpeed Insights integration
- Memory allocation
- Storage mode
Lower Window Tabs
Right Side Window Tabs
The Internal tab combines all data extracted from most other tabs, except the external, hreflang and structured data tabs. This means all data can be viewed comprehensively, and exported together for further analysis.
URLs classed as ‘Internal’ are on the same subdomain as the start page of the crawl. URLs can be made to be internal, by using the ‘crawl all subdomains‘ configuration, list mode, or the CDNs feature.
This tab includes the following columns.
- Address – The URL address.
- Content – The content type of the URL.
- Status Code – The HTTP response code.
- Status – The HTTP header response.
- Indexability – Whether the URL is Indexable or Non-Indexable.
- Indexability Status – The reason why a URL is Non-Indexable. For example, if it’s canonicalised to another URL.
- Title 1 – The (first) page title discovered on the page.
- Title 1 Length – The character length of the page title.
- Title 1 Pixel Width – The pixel width of the page title as described in our pixel width post.
- Meta Description 1 – The (first) meta description on the page.
- Meta Description Length 1 – The character length of the meta description.
- Meta Description Pixel Width – The pixel width of the meta description.
- Meta Keyword 1 – The meta keywords.
- Meta Keywords Length – The character length of the meta keywords.
- h1 – 1 – The first h1 (heading) on the page.
- h1 – Len-1 – The character length of the h1.
- h2 – 1 – The first h2 (heading) on the page.
- h2 – Len-1 – The character length of the h2.
- Meta Robots 1 – Meta robots directives found on the URL.
- X-Robots-Tag 1 – X-Robots-tag HTTP header directives for the URL.
- Meta Refresh 1 – Meta refresh data.
- Canonical Link Element – The canonical link element data.
- rel=“next” 1 – The SEO Spider collects these HTML link elements designed to indicate the relationship between URLs in a paginated series.
- rel=“prev” 1 – The SEO Spider collects these HTML link elements designed to indicate the relationship between URLs in a paginated series.
- HTTP rel=“next” 1 – The SEO Spider collects these HTTP link elements designed to indicate the relationship between URLs in a paginated series.
- HTTP rel=“prev” 1 – The SEO Spider collects these HTTP link elements designed to indicate the relationship between URLs in a paginated series.
- Size – The size reported is in kilobytes. The value is set from the Content-Length header if provided, if it’s not provided, it’s set to zero. For HTML pages this is updated to the size of the (uncompressed) HTML in KB. Upon export, size is in bytes, so please divide by 1,024 to convert to kilobytes.
- Word Count – This is all ‘words’ inside the body tag, excluding HTML markup. The count is based upon the content area that can be adjusted under ‘Config > Content > Area’. By default, the nav and footer elements are excluded. You can include or exclude HTML elements, classes and IDs to calculate a refined word count. Our figures may not be exactly what performing this calculation manually would find, as the parser performs certain fix-ups on invalid HTML. Your rendering settings also affect what HTML is considered. Our definition of a word is taking the text and splitting it by spaces. No consideration is given to visibility of content (such as text inside a div set to hidden).
- Text Ratio – Number of non-HTML characters found in the HTML body tag on a page (the text), divided by the total number of characters the HTML page is made up of, and displayed as a percentage.
- Crawl Depth – Depth of the page from the start page (number of ‘clicks’ away from the start page). Please note, redirects are counted as a level currently in our page depth calculations.
- Link Score – A metric between 0-100, which calculates the relative value of a page based upon its internal links similar to Google’s own PageRank. For this column to populate, ‘crawl analysis‘ is required.
- Inlinks – Number of internal hyperlinks to the URL. ‘Internal inlinks’ are links in anchor elements pointing to a given URL from the same subdomain that is being crawled.
- Unique Inlinks – Number of ‘unique’ internal inlinks to the URL. ‘Internal inlinks’ are links in anchor elements pointing to a given URL from the same subdomain that is being crawled. For example, if ‘page A’ links to ‘page B’ 3 times, this would be counted as 3 inlinks and 1 unique inlink to ‘page B’.
- % of Total – Percentage of unique internal inlinks (200 response HTML pages) to the URL. ‘Internal inlinks’ are links in anchor elements pointing to a given URI from the same subdomain that is being crawled.
- Outlinks – Number of internal outlinks from the URL. ‘Internal outlinks’ are links in anchor elements from a given URL to other URLs on the same subdomain that is being crawled.
- Unique Outlinks – Number of unique internal outlinks from the URL. ‘Internal outlinks’ are links in anchor elements from a given URL to other URLs on the same subdomain that is being crawled. For example, if ‘page A’ links to ‘page B’ on the same subdomain 3 times, this would be counted as 3 outlinks and 1 unique outlink to ‘page B’.
- External Outlinks – Number of external outlinks from the URL. ‘External outlinks’ are links in anchor elements from a given URL to another subdomain.
- Unique External Outlinks – Number of unique external outlinks from the URL. ‘External outlinks’ are links in anchor elements from a given URL to another subdomain. For example, if ‘page A’ links to ‘page B’ on a different subdomain 3 times, this would be counted as 3 external outlinks and 1 unique external outlink to ‘page B’.
- Closest Similarity Match – This shows the highest similarity percentage of a near duplicate URL. The SEO Spider will identify near duplicates with a 90% similarity match, which can be adjusted to find content with a lower similarity threshold. For example, if there were two near duplicate pages for a page with 99% and 90% similarity respectively, then 99% will be displayed here. To populate this column the ‘Enable Near Duplicates’ configuration must be selected via ‘Config > Content > Duplicates’, and post ‘Crawl Analysis’ must be performed. Only URLs with content over the selected similarity threshold will contain data, the others will remain blank. Thus by default, this column will only contain data for URLs with 90% or higher similarity, unless it has been adjusted via the ‘Config > Content > Duplicates’ and ‘Near Duplicate Similarity Threshold’ setting.
- No. Near Duplicates – The number of near duplicate URLs discovered in a crawl that meet or exceed the ‘Near Duplicate Similarity Threshold’, which is a 90% match by default. This setting can be adjusted under ‘Config > Content > Duplicates’. To populate this column the ‘Enable Near Duplicates’ configuration must be selected via ‘Config > Content > Duplicates’, and post ‘Crawl Analysis’ must be performed.
- Spelling Errors – The total number of spelling errors discovered for a URL. For this column to be populated then ‘Enable Spell Check’ must be selected via ‘Config > Content > Spelling & Grammar’.
- Grammar Errors – The total number of grammar errors discovered for a URL. For this column to be populated then ‘Enable Grammar Check’ must be selected via ‘Config > Content > Spelling & Grammar’.
- Language – The language selected for spelling and grammar checks. This is based upon the HTML language attribute, but the language can also be set via ‘Config > Content > Spelling & Grammar’.
- Hash – Hash value of the page using the MD5 algorithm. This is a duplicate content check for exact duplicate content only. If two hash values match, the pages are exactly the same in content. If there’s a single character difference, they will have unique hash values and not be detected as duplicate content. So this is not a check for near duplicate content. The exact duplicates can be seen under ‘URL > Duplicate’.
- Response Time – Time in seconds to download the URL. More detailed information can be found in our FAQ.
- Last-Modified – Read from the Last-Modified header in the servers HTTP response. If there server does not provide this the value will be empty.
- Redirect URI – If the ‘address’ URL redirects, this column will include the redirect URL target. The status code above will display the type of redirect, 301, 302 etc.
- HTTP Version – This shows the HTTP version the crawl was under, which will will be HTTP/1.1 only currently. The SEO Spider currently only crawls using HTTP/1.1, but this column was released in preparation for supporting HTTP/2 in a future update.
- URL Encoded Address – The URL actually requested by the SEO Spider. All non ASCII characters percent encoded, see RFC 3986 for further details.
- Title 2, meta description 2, h1-2, h2-2 etc – The SEO Spider will collect data from the first two elements it encounters in the source code. Hence, h1-2 is data from the second h1 heading on the page.
This tab includes the following filters.
- HTML – HTML pages.
- CSS – Any style sheets discovered.
- Images – Any images.
- PDF – Any portable document files.
- Flash – Any .swf files.
- Other – Any other file types, like docs etc.
- Unknown – Any URLs with an unknown content type. Either because it’s not been supplied, incorrect, or because the URL can’t be crawled. URLs blocked by robots.txt will also appear here, as their filetype is unknown for example.
Join the mailing list for updates, tips & giveawaysHow we use the data in this form
Back to top