Web Scraping

Web Scraping & Data Extraction Using The SEO Spider Tool

This tutorial walks you through how you can use the Screaming Frog SEO Spider’s custom extraction feature, to scrape data from websites.

The custom extraction feature allows you to scrape any data from the HTML of a web page using CSS Path, XPath and regex. The extraction is performed on the static HTML returned from URLs crawled by the SEO Spider, which return a 200 ‘OK’ response. To jump to examples click one of the below links:

XPath Examples
Regex Examples

To get started, you’ll need to download & install the SEO Spider software and have a licence to access the custom extraction feature necessary for scraping. You can download via the buttons in the right hand side bar.

When you have the SEO Spider open, the next steps to start extracting data are as follows –

1) Click ‘Configuration > Custom > Extraction’

This menu can be found in the top level menu of the SEO Spider.

custom extraction configuration

This will open up the custom extraction configuration with 10 separate ‘extractors’, which are set to ‘inactive’ as default.

custom extraction all extractors

2) Select CSS Path, XPath or Regex for Scraping

The Screaming Frog SEO Spider tool provides three methods for scraping data from websites:

  1. XPath – XPath is a query language for selecting nodes from an XML like document, such as HTML. This option allows you to scrape data by using XPath selectors, including attributes.
  2. CSS Path – In CSS, selectors are patterns used to select elements and are often the quickest out of the three methods available. This option allows you to scrape data by using CSS Path selectors. An optional attribute field is also available.
  3. Regex – A regular experssion is of course a special string of text used for matching patterns in data. This is best for advanced uses, such as scraping HTML comments or inline JavaScript.

CSS Path or XPath are recommended for most common scenarios, and although both have their advantages, you can simply pick the option which you’re most comfortable using.

When using XPath or CSS Path to collect HTML, you can choose exactly what to extract using the drop down filters –

  • Extract HTML Element – The selected element and all of its inner HTML content.
  • Extract Inner HTML – The inner HTML content of the selected element. If the selected element contains other HTML elements, they will be included.
  • Extract Text – The text content of the selected element and the text content of any sub elements.

3) Input Your Syntax

Next up, you’ll need to input your syntax into the relevant extractor fields. A quick and easy way to find the relevant CSS Path or Xpath of the data you wish to scrape, is to simply open up the web page in Chrome and ‘inspect element’ of the HTML line you wish to collect, then right click and copy the relevant selector path provided.

For example, you may wish to start scraping ‘authors’ of blog posts, and number of comments each have received. Let’s take the Screaming Frog website as the example.

Open up any blog post in Chrome, right click and ‘inspect element’ on the authors name which is located on every post, which will open up the ‘elements’ HTML window. Simply right click again on the relevant HTML line (with the authors name), copy the relevant CSS path or XPath and paste it into the respective extractor field in the SEO Spider. If you use Firebug in Firefox, then you can do the same there too using FirePath.

CSS Path Scraping author

You can rename the ‘extractors’, which correspond to the column names in the SEO Spider. In this example, I’ve used CSS Path.

authors and comments scraped

The ticks next to each extractor confirm the syntax used is valid. If you have a red cross next to them, then you may need to adjust a little as they are invalid.

When you’re happy, simply press the ‘OK’ button at the bottom. If you’d like to see more examples, then skip to the bottom of this guide.

Please note – This is not the most robust method for building CSS Selectors and XPath expressions. The expressions given using this method can be very specific to the exact position of the element in the code. This is something that can change due to the inspected view being the rendered version of the page / DOM, when by default the SEO Spider looks at the HTML source, and HTML clean-up that can occur when the SEO Spider processes a page where there is invalid mark-up.

These can also differ between browser, e.g. for the above ‘author’ example the following CSS Selectors are given –

Chrome: body > div.main-blog.clearfix > div > div.main-blog–posts > div.main-blog–posts_single–inside_author.clearfix.drop > div.main-blog–posts_single–inside_author-details.col-13-16 > div.author-details–social > a
Firefox: .author-details–social > a:nth-child(1)
Firefox using Firebug and FirePath extensions: .author-details–social>a

The expressions given by Firefox with Firebug and FirePath extensions are generally more robust than those provided by Chrome, and slightly more so than default Firefox. Even so, this should not be used as a complete replacement for understanding the various extraction options and being able to build these manually by examining the HTML source.

The w3schools guide on CSS Selectors and their XPath introduction are good resources for understanding the basics of these expressions.

4) Crawl The Website

Next, input the website address into the URL field at the top and click ‘start’ to crawl the website, and commence scraping.

start the crawl to scrape!

5) View Scraped Data Under The Custom Tab & ‘Extraction’ Filter

Scraped data starts appearing in real time during the crawl, under the ‘Custom’ tab and the ‘Extraction’ filter, as well as the ‘internal’ tab allowing you to export everything collected all together into Excel.

In the example outlined above, we can see the author names and number of comments next to each blog post, which have been scraped.

authors and comments extracted

When the progress bar reaches ‘100%’, the crawl has finished and you can choose to ‘export’ the data using the ‘export’ buttons.

If you already have a list of URLs you wish to extract data from, rather than crawl a website to collect the data, then you can upload them using list mode.

That’s it! Hopefully the above guide helps illustrate how to use the SEO Spider software for web scraping.

Obviously the possibilities are endless, this feature can be used to collect anything from just plain text, to Google analytics IDs, schema, social meta tags (such as Open Graph Tags & Twitter Cards), mobile annotations, hreflang values, as well as price of products, discount rates, stock availability etc. I’ve covered some more examples, which are split by the method of extraction.

XPath Examples

SEOs love Xpath. So I have put together very quick list of elements you may wish to extract, using XPath.

Headings

As default, the SEO Spider only collects h1s and h2s, but if you’d like to collect h3s, the XPath is –

//h3

The data extracted –

h3 tags scraped

However, you may wish to collect just the first h3, particularly if there are many per page. The XPath is –

/descendant::h3[1]

To collect the first 10 h3’s on a page, the XPath would be –

/descendant::h3[position() >= 0 and position() <= 10]

Hreflang

The following Xpath, combined with Extract HTML Element, will collect the contents all hreflang elements –

//*[@hreflang]

The above will collect the entire HTML element, with the link and hreflang value. The results –

hreflang scraped

So, perhaps you wanted just the hreflang values (like ‘en-GB’), you could specify the attribute using @hreflang.

//*[@hreflang]/@hreflang

The data extracted –

hreflang values extracted

Schema

You may wish to collect the types of various Schema on a page, so the set-up might be –

//*[@itemtype]/@itemtype

The data extracted –

schema scraped

Hreflang extraction functionality is now built into the Spider, for more details please see Hreflang Extraction and Hreflang Tab.

Social Meta Tags (Open Graph Tags & Twitter Cards)

You may wish to extract social meta tags, such as Facebook Open Graph tags or Twitter Cards. The Xpath is for example –

//meta[starts-with(@property, 'og:title')]/@content
//meta[starts-with(@property, 'og:description')]/@content
//meta[starts-with(@property, 'og:type')]/@content

etc.

The data extracted –

social meta tags scraped

Mobile Annotations

If you wanted to pull mobile annotations from a website, you might use an Xpath such as –

//link[contains(@media, '640') and @href]/@href

Which for the Huffington Post would extract –

mobile annotations scraped

Email Addresses

Perhaps you wanted to collect email addresses from your website or websites, the Xpath might be something like –

//a[starts-with(@href, 'mailto')]

From our website, this would return the two email addresses we have in the footer on every page –

email addresses scraped

iframes

//iframe/@src

The data extracted –

iframe scraped

To only extract iframes where a Youtube video is embedded would be –

//iframe[contains(@src ,'www.youtube.com/embed/')]

Extracting just the URL of the first iframe found on a page would be –

(//iframe/@src)[1]

AMP URLs

//head/link[@rel='amphtml']/@href

The data extracted –

AMP URLs custom extraction

Meta News Keywords

//meta[@name='news_keywords']/@content

The data extracted –

meta news keywords scraped

That’s it for now, but I’ll add to this list over time with more examples, for each method of extraction.

As always, you can pop us through any questions or queries to our support.

Extract Links In The Body Only

The following XPath will only extract links from the body of a blog post on https://www.screamingfrog.co.uk/annual-screaming-frog-macmillan-morning-bake-off/, where the blog content is contained within the class ‘main-blog–posts_single—inside’.

//div[@class="main-blog--posts_single--inside"]//a – This will get the anchor text with ‘Extract Inner HTML’.
//div[@class="main-blog--posts_single--inside"]//a/@href – This will get the URL with ‘Extract Inner HTML’.
//div[@class="main-blog--posts_single--inside"]//a – This will get the full link code with ‘Extract HTML Element’.

Regex Examples

Google Analytics ID

To extract the Google Analytics ID from a page the expression needed would be –

["'](UA-.*?)["']

Google Analytics ID Extraction

The data extracted is –

google-analytics-id-extracted

Schema

If the structured data is implemented in the JSON-LD format, regular expressions rather than XPath or CSS Selectors must be used:

("product": ".*?")
("ratingValue": ".*?")
("reviewCount": ".*?")

  • Like us on Facebook
  • +1 us on Google Plus
  • Connect with us on LinkedIn
  • Follow us on Twitter
  • View our RSS feed

Download.

Download

Purchase a licence.

Purchase