Custom extraction
Dan Sharp
Posted 13 November, 2019 by Dan Sharp in
Custom extraction
The custom extraction tab works alongside the custom extraction configuration. This feature allows you to scrape any data from the HTML of pages in a crawl and can be configured under ‘Config > Custom > Extraction’.
You’re able to configure up to 100 extractors in the custom extraction configuration, which allow you to input XPath, CSSPath or regex to scrape the required data. Extraction is performed against URLs with an HTML content type only.
The results appear within the custom extraction tab as outlined below.
Columns
This tab includes the following columns.
- Address – The URI crawled.
- Content – The content type of the URI.
- Status Code – HTTP response code.
- Status – The HTTP header response.
- [Extractor Name] – Column heading names are dynamic based upon the name provided to each extractor. Each extractor will have a seperate named column, which will contain the data extracted against each URL.
Filters
This tab includes the following filters.
- [Extractor Name] – Filters are dynamic, and will match the name of the extractors and relevant column. They show the relevant extraction column against the URLs.
Please read our tutorial on ‘Web Scraping & Custom Extraction‘.