Custom extraction

Table of Contents

General

Configuration Options

Spider Crawl Tab

Spider Extraction Tab

Spider Limits Tab

Spider Rendering Tab

Spider Advanced Tab

Spider Preferences Tab

Other Configuration Options

Tabs

Custom extraction

The custom extraction tab works alongside the custom extraction configuration. This feature allows you to scrape any data from the HTML of pages in a crawl and can be configured under ‘Config > Custom > Extraction’.

You’re able to configure up to 100 extractors in the custom extraction configuration, which allow you to input XPath, CSSPath or regex to scrape the required data. Extraction is performed against URLs with an HTML content type only.

The results appear within the custom extraction tab as outlined below.


Columns

This tab includes the following columns.

  • Address – The URI crawled.
  • Content – The content type of the URI.
  • Status Code – HTTP response code.
  • Status – The HTTP header response.
  • [Extractor Name] – Column heading names are dynamic based upon the name provided to each extractor. Each extractor will have a seperate named column, which will contain the data extracted against each URL.

Filters

This tab includes the following filters.

  • [Extractor Name] – Filters are dynamic, and will match the name of the extractors and relevant column. They show the relevant extraction column against the URLs.

Join the mailing list for updates, tips & giveaways

Back to top