Web Scraping

Web Scraping & Data Extraction Using The SEO Spider Tool

This tutorial walks you through how you can use the Screaming Frog SEO Spider’s custom extraction feature, to scrape data from websites.

The custom extraction feature allows you to scrape any data from the HTML of a web page using CSS Path, XPath and regex. The extraction is performed on the static HTML returned from URLs crawled by the SEO Spider, which return a 200 ‘OK’ response.

To get started, you’ll need to download & install the SEO Spider software and have a licence to access the custom extraction feature necessary for scraping. You can download via the buttons in the right hand side bar.

When you have the SEO Spider open, the next steps to start extracting data are as follows –

1) Click ‘Configuration > Custom > Extraction’

This menu can be found in the top level menu of the SEO Spider.

custom extraction configuration

This will open up the custom extraction configuration with 10 separate ‘extractors’, which are set to ‘inactive’ as default.

custom extraction all extractors

2) Select CSS Path, XPath or Regex for Scraping

The Screaming Frog SEO Spider tool provides three methods for scraping data from websites:

  1. XPath – XPath is a query language for selecting nodes from an XML like document, such as HTML. This option allows you to scrape data by using XPath selectors, including attributes.
  2. CSS Path – In CSS, selectors are patterns used to select elements and are often the quickest out of the three methods available. This option allows you to scrape data by using CSS Path selectors. An optional attribute field is also available.
  3. Regex – A regular experssion is of course a special string of text used for matching patterns in data. This is best for advanced uses, such as scraping HTML comments or inline JavaScript.

CSS Path or XPath are recommended for most common scenarios, and although both have their advantages, you can simply pick the option which you’re most comfortable using.

When using XPath or CSS Path to collect HTML, you can choose exactly what to extract using the drop down filters –

  • Extract HTML Element – The selected element and all of its inner HTML content.
  • Extract Inner HTML – The inner HTML content of the selected element. If the selected element contains other HTML elements, they will be included.
  • Extract Text – The text content of the selected element and the text content of any sub elements.

3) Input Your Syntax

Next up, you’ll need to input your syntax into the relevant extractor fields. A quick and easy way to find the relevant CSS Path or Xpath of the data you wish to scrape, is to simply open up the web page in Chrome and ‘inspect element’ of the HTML line you wish to collect, then right click and copy the relevant selector path provided.

For example, you may wish to start scraping ‘authors’ of blog posts, and number of comments each have received. Let’s take the Screaming Frog website as the example.

Open up any blog post in Chrome, right click and ‘inspect element’ on the authors name which is located on every post, which will open up the ‘elements’ HTML window. Simply right click again on the relevant HTML line (with the authors name), copy the relevant CSS path or XPath and paste it into the respective extractor field in the SEO Spider. If you use Firebug in Firefox, then you can do the same there too using FirePath.

CSS Path Scraping author

You can rename the ‘extractors’, which correspond to the column names in the SEO Spider. In this example, I’ve used CSS Path.

authors and comments scraped

The ticks next to each extractor confirm the syntax used is valid. If you have a red cross next to them, then you may need to adjust a little as they are invalid.

When you’re happy, simply press the ‘OK’ button at the bottom. If you’d like to see more examples, then skip to the bottom of this guide.

4) Crawl The Website

Next, input the website address into the URL field at the top and click ‘start’ to crawl the website, and commence scraping.

start the crawl to scrape!

5) View Scraped Data Under The Custom Tab & ‘Extraction’ Filter

Scraped data starts appearing in real time during the crawl, under the ‘Custom’ tab and the ‘Extraction’ filter, as well as the ‘internal’ tab allowing you to export everything collected all together into Excel.

In the example outlined above, we can see the author names and number of comments next to each blog post, which have been scraped.

authors and comments extracted

When the progress bar reaches ‘100%’, the crawl has finished and you can choose to ‘export’ the data using the ‘export’ buttons.

If you already have a list of URLs you wish to extract data from, rather than crawl a website to collect the data, then you can upload them using list mode.

That’s it! Hopefully the above guide helps illustrate how to use the SEO Spider software for web scraping.

Obviously the possibilities are endless, this feature can be used to collect anything from just plain text, to Google analytics IDs, schema, social meta tags (such as Open Graph Tags & Twitter Cards), mobile annotations, hreflang values, as well as price of products, discount rates, stock availability etc. I’ve covered some more examples, which are split by the method of extraction.

XPath Examples

SEOs love Xpath. So I have put together very quick list of elements you may wish to extract, using XPath.

Headings

//h3

The data extracted –

h3 tags scraped

However, you may wish to collect just the first h3, particularly if there are many per page. The XPath is –

/descendant::h3[1]

To collect the first 10 h3’s on a page, the XPath would be –

/descendant::h3[position() >= 0 and position() <= 10]

Hreflang


//*[@hreflang]

The above will collect the entire HTML element, with the link and hreflang value. The results -

hreflang scraped

So, perhaps you wanted just the hreflang values (like 'en-GB'), you could specify the attribute using @hreflang.


//*[@hreflang]/@hreflang

The data extracted -

hreflang values extracted

Schema


//*[@itemtype]/@itemtype

The data extracted -

schema scraped

Social Meta Tags (Open Graph Tags & Twitter Cards)


//meta[starts-with(@property, 'og:title')]/@content
//meta[starts-with(@property, 'og:description')]/@content
//meta[starts-with(@property, 'og:type')]/@content

etc.

The data extracted -

social meta tags scraped

Mobile Annotations

//link[contains(@media, '640') and @href]/@href

The data extracted -

mobile annotations scraped

Email Addresses


//a[starts-with(@href, 'mailto')]

The data extracted -

email addresses scraped

iframes

//iframe/@src

The data extracted -

iframe scraped

That's it for now, but I'll add to this list over time with more examples, for each method of extraction.

As always, you can pop us through any questions or queries to our support.

  • Like us on Facebook
  • +1 us on Google Plus
  • Connect with us on LinkedIn
  • Follow us on Twitter
  • View our RSS feed

Download.

Download

Purchase a licence.

Purchase