Table of Contents
- Installation on Windows
- Installation on macOS
- Installation on Ubuntu
- Saving, opening, exporting & importing crawls
- User agent
- Checking memory allocation
- XML sitemap creation
- Crawl analysis
- Command line interface set-up
- Command line interface
- Search function
- User Interface
Spider Crawl Tab
Spider Extraction Tab
Spider Limits Tab
Spider Rendering Tab
Spider Advanced Tab
- Cookie storage
- Ignore non-indexable URLs for on-page filters
- Ignore paginated URLs for duplicate filters
- Always follow redirects
- Always follow canonicals
- Respect noindex
- Respect canonical
- Respect next/prev
- Respect HSTS policy
- Respect self referencing meta refresh
- Extract images from img srcset attribute
- Crawl fragment identifiers
- Response timeout
- 5XX response retries
Spider Preferences Tab
Other Configuration Options
- Content area
- Spelling & grammar
- Robots.txt settings
- Custom robots.txt
- URL rewriting
- User agent
- HTTP header
- Custom search
- Custom extraction
- Custom link positions
- User Interface
- Google Analytics integration
- Google Search Console integration
- PageSpeed Insights integration
- Memory allocation
- Storage mode
Lower Window Tabs
Right Side Window Tabs
The Sitemaps tab shows all URLs discovered in a crawl, which can then be filtered to show additional information related to XML Sitemaps.
To crawl XML Sitemaps in a regular crawl and for the filters to be populated, the ‘Crawl Linked XML Sitemaps‘ configuration needs to be enabled (under ‘Configuration > Spider’).
A ‘Crawl Analysis‘ will also need to be performed at the end of the crawl to populate some of the filters.
This tab includes the following columns.
- Address – The URL crawled.
- Content – The content type of the URI.
- Status Code – HTTP response code.
- Status – The HTTP header response.
- Indexability – Whether the URL is indexable or Non-Indexable.
- Indexability Status – The reason why a URL is Non-Indexable. For example, if it’s canonicalised to another URL.
This tab includes the following filters.
- URLs In Sitemap – All URLs that are in an XML Sitemap. This should contain indexable and canonical versions of important URLs.
- URLs Not In Sitemap – URLs that are not in an XML Sitemap, but were discovered in the crawl. This might be on purpose (as they are not important), or they might be missing, and the XML Sitemap needs to be updated to include them. This filter does not consider non-indexable URLs, it assumes they are correctly non-indexable, and therefore shouldn’t be flagged to be included.
- Orphan URLs – URLs that are only in an XML Sitemap, but were not discovered during the crawl. Or, URLs that are only discovered from URLs in the XML Sitemap, but were not found in the crawl. These might be accidentally included in the XML Sitemap, or they might be pages that you wish to be indexed, and should really be linked to internally.
- Non-Indexable URLs in Sitemap – URLs that are in an XML Sitemap, but are non-indexable, and hence should be removed, or their indexability needs to be fixed.
- URLs In Multiple Sitemaps – URLs that are in more than one XML Sitemap. This isn’t necessarily a problem, but generally a URL only needs to be in a single XML Sitemap.
- XML Sitemap With Over 50k URLs – This shows any XML Sitemap that has more than the permitted 50k URLs. If you have more URLs, you will have to break your list into multiple sitemaps and create a sitemap index file which lists them all.
- XML Sitemap With Over 50mb – This shows any XML Sitemap that is larger than the permitted 50mb file size. If the sitemap is over the 50MB (uncompressed) limit, you will have to break your list into multiple sitemaps.
Join the mailing list for updates, tips & giveawaysHow we use the data in this form
Back to top