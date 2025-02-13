Posted 13 February, 2025 by Ziggy Shtrosberg in Screaming Frog SEO Spider, SEO, AI

This tutorial shows how to leverage the Screaming Frog SEO Spider’s recently released custom JavaScript functionality to create JSON-LD schema markup at scale. With a small amount of code, I will demonstrate how to extract elements from a webpage and integrate each one into a structured data script with variables.

For those of you who might not know JavaScript or be terrified of any code, you can use your favourite LLM, such as ChatGPT, to assist you. I would personally recommend Mike King’s Kermit—a custom GPT designed to specifically help you with Screaming Frog SEO Spider JS snippets!

At the end of this tutorial, my aim is for you to use this guide to automate the creation of structured data with the Screaming Frog SEO Spider — whether it’s for tens, hundreds, or thousands of webpages!

This article is a guest contribution from Ziggy Shtrosberg, a technical SEO specialist.

What Are Custom JavaScript Snippets?

In May 2024, the Screaming Frog SEO Spider introduced custom JavaScript snippets in their version 20.0 update. According to the release notes, JS snippets allow you “to manipulate pages or extract data, as well as communicate with APIs such as OpenAI’s ChatGPT, local LLMs, or other libraries.

The custom JS Snippet settings come prepacked with a few out-of-the-box snippets, such as using AI to generate image alt text, querying ChatGPT, extracting embeddings and more.

With some creative thinking, you can use the capabilities of the JS snippet functionality to transform standard crawl outputs into powerful tools for diverse SEO challenges, thereby making website crawls significantly more effective.

Why Create Structured Data at Scale With Screaming Frog?

Structured data is a type of machine code that helps to describe the webpage to a search engine. It can influence how search engines interpret and display your content. Implementing JSON-LD schema markup can lead to rich snippets in the SERPs, enhance a knowledge graph, improve visibility, and potentially increase click-through rates.

The preferable option is to always create schema markup programmatically, either with pre-built CMS functionality, a plugin, or the support of a developer.

Yet, there are instances when you don’t have access to a programmatic option. This is when Screaming Frog’s JS snippets provide a practical alternative, eliminating the need to rely on online schema markup generators or manually write the code for each page. All of which can be time-consuming and inefficient.

Selecting the Appropriate Schema Type and Page Template

Before we begin, we need to decide on what structured data script we’ll need. For the purpose of this article, I’ve used a fictional scenario in which I’ll populate each of the Search Engine Journal’s (SEJ) blog posts with “article” schema.

You’ll likely need a different script for each page template of your website. For example, an e-commerce site might require different schemas for its homepage, product pages, and blog posts.

Additionally, you’ll need to figure out what elements are available on the webpage, which you can extract from the text or HTML. That will determine what is possible with the schema markup. For example, if your blog article doesn’t reference an author, it would be pointless to add an “author” schema type.

The JSON-LD Schema Markup Template

As mentioned before, in this fictional example, I’ll be creating a JSON-LD script based on an “article” schema type, which will also include “person”, “organization”, and “website” types. With a single crawl, the script will allow us to produce a valid schema for each of SEJ’s blog posts.

I’ll use the “Getting Started In International SEO: A Quick Reference Guide” article as my guinea pig test page, as seen below:

Here is what my final JSON-LD output will look like for this webpage:

<script type="application/ld+json"> { "@context": "https://schema.org/", "@type": "Article", "url": "https://www.searchenginejournal.com/getting-started-in-international-seo-a-quick-reference-guide/529763/", "@id": "https://www.searchenginejournal.com/getting-started-in-international-seo-a-quick-reference-guide/529763/#article", "headline": "Getting Started In International SEO: A Quick Reference Guide", "description": "Expand your reach with international SEO. This guide explores the unique challenges and strategies for succeeding in global markets.", "datePublished": "2024-11-11T10:00:43+00:00", "wordCount": 2060, "timeRequired": "PT10M", "image": { "@type": "ImageObject", "url": "https://www.searchenginejournal.com/wp-content/uploads/2024/10/international-seo-794.png", "height": "840", "width": "1600" }, "speakable": { "@type": "SpeakableSpecification", "cssSelector": [ "h1", ".sej-article entrycontent" ] }, "author": { "@type": "Person", "name": "Motoko Hunt", "url": "https://www.searchenginejournal.com/author/motoko-hunt/", "@id": "https://www.searchenginejournal.com/author/motoko-hunt//#person", "jobTitle": "President, International Search Marketing", "worksFor": { "@type": "Organization", "name": "AJPR", "url": "https://www.ajpr.com/" }, "sameAs": [ "https://www.searchenginejournal.com/author/motoko-hunt/feed/", "https://twitter.com/motokohunt", "https://www.linkedin.com/in/japaneseseo/" ] }, "publisher": { "@type": "Organization", "@id": "https://www.searchenginejournal.com/#organization", "url": "https://www.searchenginejournal.com", "sameAs": [ "https://twitter.com/sejournal", "https://www.facebook.com/SearchEngineJournal", "https://www.linkedin.com/company/search-engine-journal", "https://www.youtube.com/c/searchenginejournal", "https://www.reddit.com/user/SearchEngineJournal", "https://www.google.com/search?kgmid=/m/011sh7hw", "http://www.pinterest.com/sejournal/" ], "name": "Search Engine Journal", "logo": [ { "@type": "ImageObject", "@id": "https://www.searchenginejournal.com/#logo", "inLanguage": "en-US", "url": "https://www.searchenginejournal.com/wp-content/themes/sej/images/schema/compact.png", "width": 1000, "height": 1000, "caption": "Search Engine Journal" } ], "foundingDate": "2003", "slogan": "In a world ruled by algorithms, SEJ brings timely, relevant information for SEOs, marketers, and entrepreneurs to optimize and grow their businesses -- and careers.", "description": "Search Engine Journal is dedicated to producing the latest news, the best guides and how-tos for the SEO and marketer community.", "legalName": "Search Engine Journal", "alternateName": "SEJ" }, "isPartOf": { "@type": "WebSite", "name": "Search Engine Journal", "@id": "https://www.searchenginejournal.com/#website", "url": "https://www.searchenginejournal.com/", "mainEntity": { "@id": "https://www.searchenginejournal.com/getting-started-in-international-seo-a-quick-reference-guide/529763/#article" } } } </script>

The script uses the “article” schema type. First, I will nest the “person” schema type inside the “author” property. Second, the “organization” will be nested inside the “publisher” property, and third, the “website” schema type will be nested inside the “isPartOf” property.

The Easier AI Editing Option

To successfully implement this approach, you will need some basic understanding of crawler web scraping and custom extraction functionality.

You can easily copy-paste the JS snippet code below to your favourite LLM or Mike King’s Kermit (the Screaming Frog SEO Spider JS snippet GPT) as reference and starting point. You must create your own JSON-LD script and spell out to the LLM what elements you want to extract from the webpage. However, the heavy lifting should be done for you, and using AI for this editing is highly efficient and time-effective.

For those who want a better understanding of the code and the purpose of each section, you can deep-dive into the information below.

The JS Snippet Code

As the Screaming Frog SEO Spider crawls each page, I will use JavaScript code to extract various elements from the page and add them to the JSON-LD structured data as variables.

Here is a reference of the full JS Snippet code:

// Extract the URL of the webpage let url = window.location.href; let urlId = `${url}#article`; // Extract the first H1 tag let h1 = document.querySelector('h1') ? document.querySelector('h1').textContent.trim() : ''; // Extract the meta description let metaDescription = document.querySelector('meta[name="description"]') ? document.querySelector('meta[name="description"]').content.trim() : 'No Meta Description Found'; // Extract the published date from the datetime attribute in the <time> tag within .sej-auth-t let datePublished = 'No Date Found'; let dateElement = document.querySelector('.sej-auth-t time'); if (dateElement) { datePublished = dateElement.getAttribute('datetime'); } // Select the element containing the author information let authorElement = document.querySelector('.dark-link.sej-auth-h'); // Initialize author details with empty strings let authorName = ''; let authorUrl = ''; let authorId = ''; let sameAsLinks = ''; // Check if the author element exists, then extract the name and URL if (authorElement) { authorName = authorElement.textContent.trim(); // Extract and clean up author name authorUrl = authorElement.href; // Extract author URL authorId = `${authorUrl}/#person`; // Construct @id by appending /#person // Extract all social media URLs within the .sej-asocial class for the sameAs array sameAsLinks = Array.from(document.querySelectorAll('.sej-asocial li a')).map(link => link.href); } // Extract the image URL for the author let imageUrl = ''; let imageElement = document.querySelector('.avatar.img-circle'); if (imageElement) { imageUrl = imageElement.src; } // Count the total number of words on the page let wordCount = document.body.innerText.split(/\s+/).length; // Extract the reading time and convert to ISO 8601 duration let timeRequired = ''; let readingTimeElement = document.querySelector('.sej-auth-t li:nth-child(3)'); if (readingTimeElement) { let readingTimeText = readingTimeElement.textContent.trim(); let timeMatch = readingTimeText.match(/(\d+)\s*min/); if (timeMatch) { let minutes = parseInt(timeMatch[1]); timeRequired = `PT${minutes}M`; // Convert to ISO 8601 duration format } } // Extract hero image URL, width, and height directly from the hero image element let heroImageUrl = ''; let heroImageWidth = ''; let heroImageHeight = ''; let heroImageElement = document.querySelector('.attachment-full.size-full.wp-post-image'); if (heroImageElement) { heroImageUrl = heroImageElement.src; // Try to get width and height from attributes first heroImageWidth = heroImageElement.getAttribute('width'); heroImageHeight = heroImageElement.getAttribute('height'); // If width or height are missing, parse the largest values from srcset if (!heroImageWidth || !heroImageHeight) { let srcset = heroImageElement.getAttribute('srcset'); if (srcset) { let largestImage = srcset.split(',').map(entry => { let [url, size] = entry.trim().split(' '); return { url, size: parseInt(size) }; }).sort((a, b) => b.size - a.size)[0]; // Set hero image URL, width, and height based on largest image in srcset if (largestImage) { heroImageUrl = largestImage.url; heroImageWidth = largestImage.size; heroImageHeight = Math.round((heroImageWidth / 1600) * 840); // Adjust height based on aspect ratio if needed } } } } // Create the full JSON-LD object let jsonLd = { "@context": "https://schema.org/", "@type": "Article", "url": url, "@id": urlId, "headline": h1, "description": metaDescription, "datePublished": datePublished, "wordCount": wordCount, "timeRequired": timeRequired, "image": { "@type": "ImageObject", "url": heroImageUrl, "height": heroImageHeight, "width": heroImageWidth }, "speakable": { "@type": "SpeakableSpecification", "cssSelector": [ "h1", ".sej-article entrycontent" ] }, "author": { "@type": "Person", "name": authorName, "url": authorUrl, "@id": authorId, "sameAs":sameAsLinks }, "publisher": { "@type": "Organization", "@id": "https://www.searchenginejournal.com/#organization", "url": "https://www.searchenginejournal.com", "sameAs": [ "https://twitter.com/sejournal", "https://www.facebook.com/SearchEngineJournal", "https://www.linkedin.com/company/search-engine-journal", "https://www.youtube.com/c/searchenginejournal", "https://www.reddit.com/user/SearchEngineJournal", "https://www.google.com/search?kgmid=/m/011sh7hw", "http://www.pinterest.com/sejournal/" ], "name": "Search Engine Journal", "logo": [ { "@type": "ImageObject", "@id": "https://www.searchenginejournal.com/#logo", "inLanguage": "en-US", "url": "https://www.searchenginejournal.com/wp-content/themes/sej/images/schema/compact.png", "width": 1000, "height": 1000, "caption": "Search Engine Journal" } ], "foundingDate": "2003", "slogan": "In a world ruled by algorithms, SEJ brings timely, relevant information for SEOs, marketers, and entrepreneurs to optimize and grow their businesses -- and careers.", "description": "Search Engine Journal is dedicated to producing the latest news, the best guides and how-tos for the SEO and marketer community.", "legalName": "Search Engine Journal", "alternateName": "SEJ" }, "isPartOf": { "@type": "WebSite", "name": "Search Engine Journal", "@id": "https://www.searchenginejournal.com/#website", "url": "https://www.searchenginejournal.com/", "mainEntity": { "@id": urlId } } }; // Beautify JSON-LD let beautifiedJsonLd = JSON.stringify(jsonLd, null, 2); // Manually wrap the JSON-LD with <script> tags let scriptTagWrappedJsonLd = `<script type="application/ld+json">

${beautifiedJsonLd}

</script>`; // Return the wrapped JSON-LD with <script> tags return seoSpider.data(scriptTagWrappedJsonLd);

The JavaScript Code Explained

Now, let’s break down each bit of the code to explain how each extracted element gets dynamically added as a variable to the JSON-LD script.

Webpage URL and @id

To extract the webpage URL and @id, I’ve used the following JavaScript code:

// Extract the URL of the webpage

let url = window.location.href;

let urlId = `${url}#article`;

The URL is extracted using the window.location.href, which retrieves the full URL of the current webpage.

The same URL variable is used to populate the @id, which provides a unique identifier for the “article” schema type. However, this time, I’ve used a template literal (${url}#article) to append #article to the url. This approach effectively tags the url with a specific identifier, so that urlId will be https://example.com/page#article.

It then gets populated into my JSON-LD schema as follows:

"url": url,

"@id": urlId,

The Headline

To populate the “headline” of the article, I target the article’s title below:

I use the following code to extract the H1 tag:

// Extract the first H1 tag

let h1 = document.querySelector('h1') ? document.querySelector('h1').textContent.trim() : '""';

The document.querySelector(‘h1’) looks for the page’s first <h1> element. If it finds one, it returns that element; if not, it returns an empty value inside quotation marks.

The .textContent extracts the text inside the<h1>, and .trim() removes any leading or trailing whitespace around that text.

In the schema markup script, the headline is populated using this variable:

"headline": h1,

The Meta Description

I used the following JavaScript code to extract the meta description:

// Extract the meta description

let metaDescription = document.querySelector('meta[name="description"]') ? document.querySelector('meta[name="description"]').content.trim() : '""';

The document.querySelector(‘meta[name=”description”]’) line is used to look for a tag with name=”description” in the HTML. If it finds this tag, it returns the element; if not, it returns an empty value inside quotation marks. The .trim() removes any leading or trailing whitespace.

In the schema markup script, the meta description value is populated using this variable:

"description": metaDescription,

The Date Published Value

For the date published value, I need to ensure that it follows the ISO 8601 datetime format to validate. The webpage is showing the publish date as “November 11, 2024”, but thankfully, the code is showing the datetime value in the correct ISO 8601 format:

To extract the datetime value for the datePublshed variable, I use the code below:

// Extract the published date from the datetime attribute in the <time> tag within .sej-auth-t let datePublished = '""'; let dateElement = document.querySelector('.sej-auth-t time'); if (dateElement) { datePublished = dateElement.getAttribute('datetime'); }

This code extracts a datetime attribute value from a <time> tag within an element with the class .sej-auth-t.

If no datetime value is found, it returns an empty value inside quotation marks.

In the schema markup script, the datePublished value is populated using this variable:

"datePublished": datePublished,

The Word Count

To extract and calculate the number of words in the article, I use the following code:

// Select the article element

let articleElement = document.querySelector('article[data-clarity-region="article"]');

// Count the total number of words within the article element

let wordCount = articleElement ? articleElement.innerText.split(/\s+/).length : 0;

Here, I use the code to target the <article> tag with the attribute data-clarity-region="article" , which uniquely identifies the article content.

The wordCount variable checks if articleElement exists. If it does, innerText extracts all visible text within the