SEO Spider
Web Scraping & Custom Extraction
Web Scraping & Data Extraction Using The SEO Spider
This tutorial walks you through how you can use the Screaming Frog SEO Spider’s custom extraction feature, to scrape data from websites.
The custom extraction feature allows you to scrape any data from the HTML of a web page using XPath, CSSPath and regex. You can extract data from the raw HTML, or rendered HTML if JavaScript rendering mode is enabled.
Jump to our video walkthrough or examples via the links below:
To get started, you’ll need to download & install the SEO Spider software and have a licence to access the custom extraction feature necessary for scraping.
When you have the SEO Spider open, the next steps to start extracting data are as follows:
1) Click ‘Configuration > Custom > Custom Extraction’
This menu can be found in the top level menu of the SEO Spider.
This will open up the custom extraction configuration which allows you to configure up to 100 separate ‘extractors’.
2) Add An Extactor
Click ‘Add’ in the bottom right-hand corner to set up an extractor and start scraping data.
3) Input Your Expression
The Screaming Frog SEO Spider allows you to scrape data from websites by using an in-built browser and selecting the element you wish to extract, or setting up extractors manually.
Visual Custom Extraction
To use visual custom extraction, click on the ‘browser’ icon next to the extractor.
This will open our visual custom extraction inbuilt browser. Enter a URL you wish to extract data from in the URL bar.
Next, select the element on the page you wish to scrape.
The SEO Spider will then highlight the area on the page, and create a variety of suggested expressions, and a preview of what will be extracted based upon the raw or rendered HTML. In this case, an author name from a blog post.
If the element is only appearing in the ‘Rendered HMTL Preview’ and not the ‘Source HTML Preview’, then it may well rely on JavaScript. This means you’ll need to use JavaScript rendering mode to scrape the data.
Tip!
To navigate to another page in the visual custom extraction browser, hold down control and click a link.
When using XPath or CSS Path to collect HTML, you can select what to extract using the ‘data’ dropdown –
- Extract HTML Element – The selected element and all of its inner HTML content.
- Extract Inner HTML – The inner HTML content of the selected element. If the selected element contains other HTML elements, they will be included.
- Extract Text – The text content of the selected element and the text content of any sub elements.
- Function Value – The result of the supplied function, eg count(//h1) to find the number of h1 tags on a page.
In this case, it’s author text, so ‘Extract Text’ has been selected. The extractor ‘name’ field can also be updated which correspond to the column names – in this case to ‘Author’.
Click ‘OK’ to set-up the extractor and close the visual custom extraction browser, or ‘Add Extractor’ to set-up the extractor and keep the visual custom extraction browser open to set-up another extractor.
If the element isn’t on the page, you can switch to Rendered or Source HTML view and pick a line of HTML instead. For example, if you wish to extract the ‘content’ of a meta property tag in the head of the HTML –
You can then select the attribute you wish to extract from the dropdown, and it will formulate the expression for you.
In this case below, it will scrape the published time, which is shown in the source and rendered HTML previews after selecting the ‘content’ attribute.
That’s the end of this step for those that are using visual custom extraction.
Manual Custom Extraction
For users that have mastered XPath, CSSPath and regex, you can input your expression manually. There are 3 ways to extract data:
- XPath – XPath is a query language for selecting nodes from an XML like document, such as HTML. This option allows you to scrape data by using XPath selectors, including attributes.
- CSSPath – In CSS, selectors are patterns used to select elements and are often the quickest out of the three methods available. This option allows you to scrape data by using CSS Path selectors. An optional attribute field is also available.
- Regex – A regular expression is of course a special string of text used for matching patterns in data. This is best for advanced uses, such as scraping HTML comments or inline JavaScript.
CSSPath or XPath are recommended for most common scenarios, and although both have their advantages, you can simply pick the option which you’re most comfortable using. Regex is required for anything that is not part of a HTML element, for example any JSON found in the code.
You can use a browser, such as Chrome, and right click ‘inspect element’, and copy the XPath or Selector provided into the extractor field.
Then rename the ‘extractors’, which correspond to the column names in the SEO Spider.
The ticks next to each extractor confirm the syntax used is valid. If you have a red cross next to them, then you may need to adjust a little as they are invalid.
When you’re happy, simply press the ‘OK’ button at the bottom. If you’d like to see more examples, then skip to the bottom of this guide.
Caveats
Copying expressions directly from a browser is not the most robust method for building CSS Selectors and XPath expressions.
The expressions provided using this method can be specific to the exact position of the element in the code. This can change as the inspected view is the rendered version of the page after JavaScript has been processed. By default the SEO Spider extracts from the HTML source unless JavaScript rendering mode is enabled.
Different browsers can provide different expressions, and various HTML clean-ups can also occur when the SEO Spider processes a page where there is invalid mark-up.
The w3schools guide on CSS Selectors and their XPath introduction are good resources for understanding the basics of these expressions.
4) Crawl The Website
Input the website address into the URL bar and click ‘start’ to crawl the website, and commence scraping.
The progress of the crawl can be seen in the progress bar in the top right, but you don’t have to wait until the crawl has finished to view data.
5) View Scraped Data Under The Custom Extraction Tab
Scraped data starts appearing in real-time during the crawl, under the ‘Custom Extraction’ tab, as well as the ‘Internal’ tab allowing you to export everything collected all together into a spread sheet.
In the example outlined above, we can see the scraped author names and published dates next to each blog post.
When the progress bar reaches ‘100%’, the crawl has finished and you can choose to ‘export’ the data using the ‘export’ buttons on the tab.
If you already have a list of URLs you wish to extract data from, rather than crawl a website to collect the data, then you can upload them using list mode.
That’s it! Hopefully the above guide helps illustrate how to use the SEO Spider software for web scraping.
Video Walkthrough
Watch our quickfire video tutorial on how to set up custom extraction.
XPath Examples
SEOs love XPath! So below is a list of various elements you can extract with the XPath required. The SEO Spider supports XPath version 1.0 currently.
Jump to a specific XPath extraction example:
Headings
Hreflang
Structured Data
Social Meta Tags (Open Graph Tags & Twitter Cards)
Mobile Annotations
Email Addresses
iframes
AMP URLs
Meta News Keywords
Meta Viewport Tag
Extract Links In The Body Only
Extract Links Containing Anchor Text
Extract Links to a Specific Domain
Extract Content From Specific Divs
Extract Multiple Matched Elements
Headings
As default, the SEO Spider only collects h1s and h2s, but if you’d like to collect h3s, the XPath is –
//h3
The data extracted –
However, you may wish to collect just the first h3, particularly if there are many per page. The XPath is –
/descendant::h3[1]
To collect the first 10 h3’s on a page, the XPath would be –
/descendant::h3[position() >= 0 and position() <= 10]
To count the number of h3 tags on a page the expression needed is –
count(//h3)
In this case ‘Extract Inner HTML’ in the far right dropdown of the Custom Extraction Window must be changed to ‘Function Value’ for this expression to work correctly.
The length of any extracted string can also be calculated with XPath using the ‘Function Value’ option. To calculate the length of the h3’s on the page the following expression is needed –
string-length(//h3)
Hreflang
The following Xpath, combined with Extract HTML Element, will collect the contents all hreflang elements –
//*[@hreflang]
The above will collect the entire HTML element, with the link and hreflang value. The results –
So, perhaps you wanted just the hreflang values (like ‘en-GB’), you could specify the attribute using @hreflang.
//*[@hreflang]/@hreflang
The data extracted –
Hreflang analysis functionality is now built into the SEO Spider as standard, for more details please see Hreflang Extraction and Hreflang Tab.
Structured Data
You may wish to collect the types of various Schema on a page, so the set-up might be –
//*[@itemtype]/@itemtype
The data extracted –
For ‘itemprop’ rules, you can use a similar XPath –
//*[@itemprop]/@itemprop
Don’t forget, the SEO Spider can extract and validate structured data without requiring custom extraction.
Social Meta Tags (Open Graph Tags & Twitter Cards)
You may wish to extract social meta tags, such as Facebook Open Graph tags, account details, or Twitter Cards. The Xpath is for example –
//meta[starts-with(@property, 'og:title')]/@content
//meta[starts-with(@property, 'og:description')]/@content
//meta[starts-with(@property, 'og:type')]/@content
//meta[starts-with(@property, 'og:site_name')]/@content
//meta[starts-with(@property, 'og:image')]/@content
//meta[starts-with(@property, 'og:url')]/@content
//meta[starts-with(@property, 'fb:page_id')]/@content
//meta[starts-with(@property, 'fb:admins')]/@content
//meta[starts-with(@name, 'twitter:title')]/@content
//meta[starts-with(@name, 'twitter:description')]/@content
//meta[starts-with(@name, 'twitter:account_id')]/@content
//meta[starts-with(@name, 'twitter:card')]/@content
//meta[starts-with(@name, 'twitter:image:src')]/@content
//meta[starts-with(@name, 'twitter:creator')]/@content
etc.
The data extracted –
Mobile Annotations
If you wanted to pull mobile annotations from a website, you might use an Xpath such as –
//link[contains(@media, '640') and @href]/@href
Which for the Huffington Post would extract –
Email Addresses
Perhaps you wanted to collect email addresses from your website or websites, the Xpath might be something like –
//a[starts-with(@href, 'mailto')]
From our website, this would return the two email addresses we have in the footer on every page –
iframes
//iframe/@src
The data extracted –
To only extract iframes where a Youtube video is embedded would be –
//iframe[contains(@src ,'www.youtube.com/embed/')]
To extract iframes, but not a particular iframe URL such as Google Tag Manager URLs would be –
//iframe[not(contains(@src, 'https://www.googletagmanager.com/'))]/@src
Extracting just the URL of the first iframe found on a page would be –
(//iframe/@src)[1]
AMP URLs
//head/link[@rel='amphtml']/@href
The data extracted –
Meta News Keywords
//meta[@name='news_keywords']/@content
The data extracted –
Meta Viewport Tag
//meta[@name='viewport']/@content
The data extracted –
Extract Links In The Body Only
The following XPath will only extract links from the body of a blog post on https://www.screamingfrog.co.uk/annual-screaming-frog-macmillan-morning-bake-off/, where the blog content is contained within the class ‘main-blog–posts_single—inside’.
This will get the anchor text with ‘Extract Inner HTML’:
//div[@class="main-blog--posts_single--inside"]//a
This will get the URL with ‘Extract Inner HTML’:
//div[@class="main-blog--posts_single--inside"]//a/@href
This will get the full link code with ‘Extract HTML Element’:
//div[@class="main-blog--posts_single--inside"]//a
Extract Links Containing Anchor Text
To extract all links with ‘SEO Spider’ in the anchor text:
//a[contains(.,'SEO Spider')]/@href
This matching is case sensitive, so if ‘SEO Spider’ is sometimes ‘seo spider’, you’ll have to do the following:
//a[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'),'seo spider')]/@href
Which will lower case all found anchor text, allowing you to compare it against a lower case ‘seo spider’.
Extract Links to a Specific Domain
To extract all the links from a page referencing ‘screamingfrog.co.uk’ you can use:
//a[contains(@href,'screamingfrog.co.uk')]
Using the ‘Extract HTML Element’ or ‘Extract Text’ will allow you to extract with the full link code or just the anchor text respectively.
If you only want to extract the linked URL you can use:
//a[contains(@href,'screamingfrog.co.uk')]/@href
Extract Content From Specific Divs
The following XPath will extract content from specific divs or spans, using their class ID. You’ll need to replace that with your own.
//div[@class="example"]
//span[@class="example"]
Extract Multiple Matched Elements
A pipe can be used between expressions in a single extractor to keep related elements next to each other in an export.
The following expression matches blog titles and the number of comments they have on blog archive pages:
//div[contains(@class ,'main-blog--posts_single-inner--text--inner')]//h3|//a[@class="comments-link"]
Regex Examples
Jump to a specific Regex extraction example:
Google Analytics ID
Structured Data
Email Addresses
Google Analytics and Tag Manager IDs
To extract the Google Analytics ID from a page the expression needed would be –
["'](UA-.*?)["']
For Google Tag Manager (GTM) it would be –
["'](GTM-.*?)["']
The data extracted is –
Structured Data
If structured data is implemented in JSON-LD format, regular expressions rather than XPath or CSS Selectors must be used. You may find that some implementations have additional spaces before or after the colon, so the below example might need to be adjusted to match the HTML:
"product": "(.*?)"
"ratingValue": "(.*?)"
"reviewCount": "(.*?)"
To extract everything in the JSON-LD script tag, you could use –
<script type=\"application\/ld\+json\">(.*?)</script>
Email Addresses
The following will return any alpha numeric string, that contains an @ in the middle:
[a-zA-Z0-9-_.]+@[a-zA-Z0-9-.]+
The following expression will bring back fewer false positives, as it requires at least a single period in the second half of the string:
[a-zA-Z0-9-_.]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+
That’s it for now, but I’ll add to this list over time with more examples, for each method of extraction.
As always, you can pop us through any questions or queries to our support.