Custom extraction
Table of Contents
General
- Installation
- Installation on Windows
- Installation on macOS
- Installation on Ubuntu
- Installation on Fedora
- Crawling
- Saving, opening, exporting & importing crawls
- Configuration
- Scheduling
- Exporting
- Robots.txt
- User agent
- Memory
- Checking memory allocation
- Cookies
- XML sitemap creation
- Visualisations
- Reports
- Command line interface set-up
- Command line interface
- User Interface
- Search function
- Auto Updates
Configuration Options
Spider Crawl Tab
- Images
- Media
- CSS
- JavaScript
- SWF
- Internal hyperlinks
- External links
- Canonicals
- Pagination (rel next/prev)
- Hreflang
- AMP
- Meta refresh
- iframes
- Mobile alternate
- Check links outside of start folder
- Crawl outside of start folder
- Crawl all subdomains
- Follow internal or external ‘nofollow’
- Crawl linked XML sitemaps
Spider Extraction Tab
Spider Limits Tab
Spider Rendering Tab
Spider Advanced Tab
- Cookie storage
- Ignore non-indexable URLs for Issues
- Ignore paginated URLs for duplicate filters
- Always follow redirects
- Always follow canonicals
- Respect noindex
- Respect canonical
- Respect next/prev
- Respect HSTS policy
- Respect self referencing meta refresh
- Extract images from img srcset attribute
- Crawl fragment identifiers
- Perform HTML validation
- Green hosting carbon calculation
- Assume pages are HTML
- Response timeout
- 5XX response retries
Spider Preferences Tab
Other Configuration Options
- Content area
- Duplicates
- Spelling & grammar
- Robots.txt
- URL rewriting
- CDNs
- Include
- Exclude
- Speed
- User agent
- HTTP header
- Custom search
- Custom extraction
- Custom link positions
- Custom JavaScript
- Google Analytics integration
- Google Search Console integration
- PageSpeed Insights integration
- Majestic
- Ahrefs
- Moz
- Authentication
- Segments
- Crawl analysis
- User Interface
- Language
- Proxy
- Storage mode
- Memory allocation
- Trusted Certificates
- Mode
Tabs
Top Tabs
- Internal
- External
- Security
- Response Codes
- URL
- Page titles
- Meta description
- Meta keywords
- h1
- h2
- Content
- Images
- Canonicals
- Pagination
- Directives
- hreflang
- JavaScript
- Links
- AMP
- Structured data
- Sitemaps
- PageSpeed
- Mobile
- Custom search
- Custom extraction
- Custom JavaScript
- Analytics
- Search Console
- Validation
- Link Metrics
- Change Detection
Lower Window Tabs
Right Side Window Tabs
Custom extraction
The custom extraction tab works alongside the custom extraction configuration. This feature allows you to scrape any data from the HTML of pages in a crawl and can be configured under ‘Config > Custom > Extraction’.
You’re able to configure up to 100 extractors in the custom extraction configuration, which allow you to input XPath, CSSPath or regex to scrape the required data. Extraction is performed against URLs with an HTML content type only.
The results appear within the custom extraction tab as outlined below.
Columns
This tab includes the following columns.
- Address – The URI crawled.
- Content – The content type of the URI.
- Status Code – HTTP response code.
- Status – The HTTP header response.
- [Extractor Name] – Column heading names are dynamic based upon the name provided to each extractor. Each extractor will have a seperate named column, which will contain the data extracted against each URL.
Filters
This tab includes the following filters.
- [Extractor Name] – Filters are dynamic, and will match the name of the extractors and relevant column. They show the relevant extraction column against the URLs.