Include
Table of Contents
General
- Installation
- Installation on Windows
- Installation on macOS
- Installation on Ubuntu
- Installation on Fedora
- Crawling
- Saving, opening, exporting & importing crawls
- Configuration
- Scheduling
- Exporting
- Robots.txt
- User agent
- Memory
- Checking memory allocation
- Cookies
- XML sitemap creation
- Visualisations
- Reports
- Command line interface set-up
- Command line interface
- User Interface
- Search function
- Auto Updates
Configuration Options
Spider Crawl Tab
- Images
- Media
- CSS
- JavaScript
- SWF
- Internal hyperlinks
- External links
- Canonicals
- Pagination (rel next/prev)
- Hreflang
- AMP
- Meta refresh
- iframes
- Mobile alternate
- Check links outside of start folder
- Crawl outside of start folder
- Crawl all subdomains
- Follow internal or external ‘nofollow’
- Crawl linked XML sitemaps
Spider Extraction Tab
Spider Limits Tab
Spider Rendering Tab
Spider Advanced Tab
- Cookie storage
- Ignore non-indexable URLs for Issues
- Ignore paginated URLs for duplicate filters
- Always follow redirects
- Always follow canonicals
- Respect noindex
- Respect canonical
- Respect next/prev
- Respect HSTS policy
- Respect self referencing meta refresh
- Extract images from img srcset attribute
- Crawl fragment identifiers
- Perform HTML validation
- Green hosting carbon calculation
- Assume pages are HTML
- Response timeout
- 5XX response retries
Spider Preferences Tab
Other Configuration Options
- Content area
- Duplicates
- Spelling & grammar
- Robots.txt
- URL rewriting
- CDNs
- Include
- Exclude
- Speed
- User agent
- HTTP header
- Custom search
- Custom extraction
- Custom link positions
- Custom JavaScript
- Google Analytics integration
- Google Search Console integration
- PageSpeed Insights integration
- Majestic
- Ahrefs
- Moz
- Authentication
- Segments
- Crawl analysis
- User Interface
- Language
- Proxy
- Storage mode
- Memory allocation
- Trusted Certificates
- Mode
Tabs
Top Tabs
- Internal
- External
- Security
- Response Codes
- URL
- Page titles
- Meta description
- Meta keywords
- h1
- h2
- Content
- Images
- Canonicals
- Pagination
- Directives
- hreflang
- JavaScript
- Links
- AMP
- Structured data
- Sitemaps
- PageSpeed
- Mobile
- Custom search
- Custom extraction
- Custom JavaScript
- Analytics
- Search Console
- Validation
- Link Metrics
- Change Detection
Lower Window Tabs
Right Side Window Tabs
Include
Configuration > Include
This feature allows you to control which URL path the SEO Spider will crawl using partial regex matching. It narrows the default search by only crawling the URLs that match the regex which is particularly useful for larger sites, or sites with less intuitive URL structures. Matching is performed on the encoded version of the URL.
The page that you start the crawl from must have an outbound link which matches the regex for this feature to work, or it just won’t crawl onwards. If there is not a URL which matches the regex from the start page, the SEO Spider will not crawl anything!
- As an example, if you wanted to crawl pages from https://www.screamingfrog.co.uk which have ‘search’ in the URL string you would simply include the regex: search in the ‘include’ feature. This would find the /search-engine-marketing/ and /search-engine-optimisation/ pages as they both have ‘search’ in them.
Check out our video guide on the include feature.
Troubleshooting
- Matching is performed on the URL encoded address, you can see what this is in the URL Info tab in the lower window pane or respective column in the Internal tab.
- The regular expression must match the whole URL, not just part of it.
- If you experience just a single URL being crawled and then the crawl stopping, check your outbound links from that page. If you crawl http://www.example.com/ with an include of ‘/news/’ and only 1 URL is crawled, then it will be because http://www.example.com/ does not have any links to the news section of the site.