Web Scraping

Web Scraping & Data Extraction Using The SEO Spider Tool

This tutorial walks you through how you can use the Screaming Frog SEO Spider’s custom extraction feature, to scrape data from websites.

The custom extraction feature allows you to scrape any data from the HTML of a web page using CSS Path, XPath and regex. The extraction is performed on the static HTML returned from URLs crawled by the SEO Spider, which return a 200 ‘OK’ response. To jump to examples click one of the below links:

XPath Examples
Regex Examples

To get started, you’ll need to download & install the SEO Spider software and have a licence to access the custom extraction feature necessary for scraping. You can download via the buttons in the right hand side bar.

When you have the SEO Spider open, the next steps to start extracting data are as follows –

1) Click ‘Configuration > Custom > Extraction’

This menu can be found in the top level menu of the SEO Spider.

custom extraction for web scraping

This will open up the custom extraction configuration with 10 separate ‘extractors’, which are set to ‘inactive’ as default.

custom extraction all extractors

2) Select CSS Path, XPath or Regex for Scraping

The Screaming Frog SEO Spider tool provides three methods for scraping data from websites:

  1. XPath – XPath is a query language for selecting nodes from an XML like document, such as HTML. This option allows you to scrape data by using XPath selectors, including attributes.
  2. CSS Path – In CSS, selectors are patterns used to select elements and are often the quickest out of the three methods available. This option allows you to scrape data by using CSS Path selectors. An optional attribute field is also available.
  3. Regex – A regular expression is of course a special string of text used for matching patterns in data. This is best for advanced uses, such as scraping HTML comments or inline JavaScript.

CSS Path or XPath are recommended for most common scenarios, and although both have their advantages, you can simply pick the option which you’re most comfortable using.

When using XPath or CSS Path to collect HTML, you can choose exactly what to extract using the drop down filters –

  • Extract HTML Element – The selected element and all of its inner HTML content.
  • Extract Inner HTML – The inner HTML content of the selected element. If the selected element contains other HTML elements, they will be included.
  • Extract Text – The text content of the selected element and the text content of any sub elements.

3) Input Your Syntax

Next up, you’ll need to input your syntax into the relevant extractor fields. A quick and easy way to find the relevant CSS Path or Xpath of the data you wish to scrape, is to simply open up the web page in Chrome and ‘inspect element’ of the HTML line you wish to collect, then right click and copy the relevant selector path provided.

For example, you may wish to start scraping ‘authors’ of blog posts, and number of comments each have received. Let’s take the Screaming Frog website as the example.

Open up any blog post in Chrome, right click and ‘inspect element’ on the authors name which is located on every post, which will open up the ‘elements’ HTML window. Simply right click again on the relevant HTML line (with the authors name), copy the relevant CSS path or XPath and paste it into the respective extractor field in the SEO Spider. If you use Firefox, then you can do the same there too.

CSS Path Scraping author

You can rename the ‘extractors’, which correspond to the column names in the SEO Spider. In this example, I’ve used CSS Path.

custom extraction of authors and comments

The ticks next to each extractor confirm the syntax used is valid. If you have a red cross next to them, then you may need to adjust a little as they are invalid.

When you’re happy, simply press the ‘OK’ button at the bottom. If you’d like to see more examples, then skip to the bottom of this guide.

Please note – This is not the most robust method for building CSS Selectors and XPath expressions. The expressions given using this method can be very specific to the exact position of the element in the code. This is something that can change due to the inspected view being the rendered version of the page / DOM, when by default the SEO Spider looks at the HTML source, and HTML clean-up that can occur when the SEO Spider processes a page where there is invalid mark-up.

These can also differ between browser, e.g. for the above ‘author’ example the following CSS Selectors are given –

Chrome: body > div.main-blog.clearfix > div > div.main-blog–posts > div.main-blog–posts_single–inside_author.clearfix.drop > div.main-blog–posts_single–inside_author-details.col-13-16 > div.author-details–social > a
Firefox: .author-details–social > a:nth-child(1)

The expressions given by Firefox are generally more robust than those provided by Chrome. Even so, this should not be used as a complete replacement for understanding the various extraction options and being able to build these manually by examining the HTML source.

The w3schools guide on CSS Selectors and their XPath introduction are good resources for understanding the basics of these expressions.

4) Crawl The Website

Next, input the website address into the URL field at the top and click ‘start’ to crawl the website, and commence scraping.

crawl the site to scrape it

5) View Scraped Data Under The Custom Tab & ‘Extraction’ Filter

Scraped data starts appearing in real time during the crawl, under the ‘Custom’ tab and the ‘Extraction’ filter, as well as the ‘internal’ tab allowing you to export everything collected all together into Excel.

In the example outlined above, we can see the author names and number of comments next to each blog post, which have been scraped.

custom extraction scraping of authors and comments

When the progress bar reaches ‘100%’, the crawl has finished and you can choose to ‘export’ the data using the ‘export’ buttons.

If you already have a list of URLs you wish to extract data from, rather than crawl a website to collect the data, then you can upload them using list mode.

That’s it! Hopefully the above guide helps illustrate how to use the SEO Spider software for web scraping.

Obviously the possibilities are endless, this feature can be used to collect anything from just plain text, to Google analytics IDs, schema, social meta tags (such as Open Graph Tags & Twitter Cards), mobile annotations, hreflang values, as well as price of products, discount rates, stock availability etc. I’ve covered some more examples, which are split by the method of extraction.

XPath Examples

SEOs love Xpath. So I have put together very quick list of elements you may wish to extract, using XPath. The SEO Spider uses the XPath implementation from Java 8, which supports XPath version 1.0.


As default, the SEO Spider only collects h1s and h2s, but if you’d like to collect h3s, the XPath is –


The data extracted –

h3s scraped

However, you may wish to collect just the first h3, particularly if there are many per page. The XPath is –


To collect the first 10 h3’s on a page, the XPath would be –

/descendant::h3[position() >= 0 and position() <= 10]

To count the number of h3 tags on a page the expression needed is –


In this case ‘Extract Inner HTML’ in the far right dropdown of the Custom Extraction Window must be changed to ‘Function Value’ for this expression to work correctly.

The length of any extracted string can also be calculated with XPath using the ‘Function Value’ option. To calculate the length of the h3’s on the page the following expression is needed –



The following Xpath, combined with Extract HTML Element, will collect the contents all hreflang elements –


The above will collect the entire HTML element, with the link and hreflang value. The results –

hreflang extracted

So, perhaps you wanted just the hreflang values (like ‘en-GB’), you could specify the attribute using @hreflang.


The data extracted –

hreflang values extracted

Hreflang analysis functionality is now built into the SEO Spider as standard, for more details please see Hreflang Extraction and Hreflang Tab.


You may wish to collect the types of various Schema on a page, so the set-up might be –


The data extracted –

schema extracted

For ‘itemprop’ rules, you can use a similar XPath –


Social Meta Tags (Open Graph Tags & Twitter Cards)

You may wish to extract social meta tags, such as Facebook Open Graph tags, account details, or Twitter Cards. The Xpath is for example –

//meta[starts-with(@property, 'og:title')]/@content
//meta[starts-with(@property, 'og:description')]/@content
//meta[starts-with(@property, 'og:type')]/@content
//meta[starts-with(@property, 'og:site_name')]/@content
//meta[starts-with(@property, 'og:image')]/@content
//meta[starts-with(@property, 'og:url')]/@content
//meta[starts-with(@property, 'fb:page_id')]/@content
//meta[starts-with(@property, 'fb:admins')]/@content

//meta[starts-with(@property, 'twitter:title')]/@content
//meta[starts-with(@property, 'twitter:description')]/@content
//meta[starts-with(@property, 'twitter:account_id')]/@content
//meta[starts-with(@property, 'twitter:card')]/@content
//meta[starts-with(@property, 'twitter:image:src')]/@content
//meta[starts-with(@property, 'twitter:creator')]/@content


The data extracted –

social meta tags

Mobile Annotations

If you wanted to pull mobile annotations from a website, you might use an Xpath such as –

//link[contains(@media, '640') and @href]/@href

Which for the Huffington Post would extract –

web scraping of mobile annotations

Email Addresses

Perhaps you wanted to collect email addresses from your website or websites, the Xpath might be something like –

//a[starts-with(@href, 'mailto')]

From our website, this would return the two email addresses we have in the footer on every page –

email extracted



The data extracted –

iframe extracted

To only extract iframes where a Youtube video is embedded would be –

//iframe[contains(@src ,'www.youtube.com/embed/')]

To extract iframes, but not a particular iframe URL such as Google Tag Manager URLs would be –

//iframe[not(contains(@src, 'https://www.googletagmanager.com/'))]/@src

Extracting just the URL of the first iframe found on a page would be –




The data extracted –

AMP scraped

Meta News Keywords


The data extracted –

meta news keyword scraped

Extract Links In The Body Only

The following XPath will only extract links from the body of a blog post on https://www.screamingfrog.co.uk/annual-screaming-frog-macmillan-morning-bake-off/, where the blog content is contained within the class ‘main-blog–posts_single—inside’.

//div[@class="main-blog--posts_single--inside"]//a – This will get the anchor text with ‘Extract Inner HTML’.
//div[@class="main-blog--posts_single--inside"]//a/@href – This will get the URL with ‘Extract Inner HTML’.
//div[@class="main-blog--posts_single--inside"]//a – This will get the full link code with ‘Extract HTML Element’.

Extract Links Containing Anchor Text

To extract all links with ‘SEO Spider’ in the anchor text:

//a[contains(.,'SEO Spider')]/@href

This matching is case sensitive, so if ‘SEO Spider’ is sometimes ‘seo spider’, you’ll have to do the following:

//a[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'),'seo spider')]/@href

Which will lower case all found anchor text, allowing you to compare it against a lower case ‘seo spider’.

Extract Content From Specific Divs

The following XPath will extract content from specific divs or spans, using their class ID. You’ll need to replace that with your own.



Extract Multiple Matched Elements

A pipe can be used between expressions in a single extractor to keep related elements next to each other in an export.

The following expression matches blog titles and the number of comments they have on blog archive pages:

//div[contains(@class ,'main-blog--posts_single-inner--text--inner')]//h3|//a[@class="comments-link"]

Multiple Matched XPath Extraction

Regex Examples

Google Analytics ID

To extract the Google Analytics ID from a page the expression needed would be –


GA UA ID extraction

The data extracted is –

GA UA ID scraping


If the structured data is implemented in the JSON-LD format, regular expressions rather than XPath or CSS Selectors must be used:

("product": ".*?")
("ratingValue": ".*?")
("reviewCount": ".*?")

To extract everything in the JSON-LD script tag, you could use –

<script type=\"application\/ld\+json\">(.*?)</script>

Email Addresses

The following will return any alpha numeric string, that contains an @ in the middle:


The following expression will bring back fewer false positives, as it requires at least a single period in the second half of the string:


That’s it for now, but I’ll add to this list over time with more examples, for each method of extraction.

As always, you can pop us through any questions or queries to our support.

  • Like us on Facebook
  • Connect with us on LinkedIn
  • Follow us on Twitter
  • View our RSS feed



Purchase a licence.