Automating XML Sitemaps With Screaming Frog

Logan Ray

Posted 15 February, 2022 by Logan Ray in Screaming Frog SEO Spider

Automating XML Sitemaps With Screaming Frog

When Screaming Frog first rolled out the scheduled crawl feature, I was thrilled. I knew there were some great use-cases and couldn’t wait to find them. I familiarized myself with the feature by setting up a standard weekly crawl for each of my clients. Then it was on to find some more advanced uses.

This article is a guest contribution from Logan Ray of Beacon.

Around the same time, I was exploring ways for our team to gain more autonomy. The more processes we could take into our own hands, the better. One of the quickest wins was automating XML sitemaps; doing so would give us more control and efficiency. My motivators were three-fold – sites on a CMS without dynamically generated XML sitemaps, no access to the server, and efficiency. I’m sure there are more I haven’t come across yet, and I’d love to hear from you on more use-cases.

Use the Screaming Frog SEO Spider to automate your XML sitemaps if:

You use an uncommon CMS platform
You want more control over the contents of your XML sitemaps
You have limited access to servers/devs

What you’ll need:

The Paid version of the Screaming Frog SEO Spider
An IT/tech team that can implement reverse proxies
A dedicated machine (this isn’t necessary but it’ll make your life a lot easier)

What are the steps for automating your XML sitemaps?

Setting up your automated crawl
Establishing a central location for storing SF output files
Creating the reverse proxy
Testing

Why Automate XML Sitemaps?

The first reason is because you use an uncommon CMS. Platforms like WordPress and Shopify offer great solutions to keeping your XML sitemap fresh. But what if the site uses a proprietary CMS without a built-in system or a huge publicly-created plugin library? The solution I’m about to explain is CMS-agnostic, meaning you can set it up for any site, regardless of what it’s built on.

The next key motivator for using the Screaming Frog SEO Spider to auto-generate your XML sitemaps is to customise them. Some platforms that can dynamically update sitemaps don’t give you much control over what’s included in it, they just dump everything in there. You may want to exclude a folder or specific set of URLs, which is far easier to do in the SEO Spider settings.

Lastly, you might run into a situation like mine where you don’t have access to the server to update these files on your own. This is where the reverse proxy comes in, we’ll get to that later. If you have access to the appropriate server folders, you can skip the reverse proxy step.

What You’ll Need

The Paid Version of the Screaming Frog SEO Spider

This goes without saying, but if you’re not paying for the Screaming Frog SEO Spider, stop reading this and do it. Now. But seriously, the scheduled crawl feature isn’t available in the free version, so this is a requirement for automating your XML sitemaps.

An IT/Tech Team That Can Implement Reverse Proxies

In my experience, IT and dev groups aren’t very open to the idea of giving SEOs access to server folders. If you’re in this boat, you’re going to need someone to implement your reverse proxies.

Optional: A Dedicated Machine

While a dedicated machine isn’t required for this automation, it’s helpful.

In order for scheduled crawls to run, the machine they’re set up on needs to be on – if you run these over the weekend and shut down your computer on Friday, no more automation. For this reason, I had our IT group set up a machine that we remote into for the initial configuration.

What Are the Steps for Automating Your Xml Sitemaps?

Setting up Your Automated Crawl

The first thing you want to do is configure your scheduled crawl. If you haven’t played around with this yet, I HIGHLY recommend it. I’ve got several scheduled crawls set up, most of which are for XMLs, but I have a few other purpose-built crawls that run regularly.

As for cadence, I run mine weekly, but you can set this up for any interval you want. Depending on how often content is added to your site, you may want to toggle this up or down.

You’ll also need to assess whether or not you need a custom-built crawl settings file. This will mostly depend on whether or not you want to customise the contents of your sitemap. Most of my clients need this. In some cases, it’s because we have a sitemap index and therefore a different settings file for each of the segmented XMLs. In other cases, there are some customisations I wanted to bake in.

Setting up a scheduled crawl (File > Scheduling) is simple: Give it a name, and set your frequency and timing. I recommend using the description as a place to note frequency. This is helpful when you’ve got several to set up – the description shows up in the list but not the date/time.

Running in headless mode is required for exports, so be sure to check that box. You’ll also want to overwrite files in output so your filename doesn’t change. In order for the reverse proxy to work, you need a consistent file path. And of course, save the crawl and export the XML sitemap.

If you’re setting up a sitemap index with nested sitemaps within, you’ll need to set up individual crawls using includes and excludes to segment them the way you want.

One final thing to note with regards to your crawl settings – go to the Sitemap Export Configuration and choose your settings there before you save the crawl settings file. This will ensure the export format is what you want – otherwise, it includes things like change frequency and priority by default.

Establishing a Central Location for Storing Screaming Frog Output Files

In order for the reverse proxy to work, make sure you have your scheduled crawl dump the files in a specific location and as mentioned above, turn on the ‘overwrite files’ option rather than date-stamping your files. This server location will also need to be accessible via the web. So if your file path on the server is Z:\\client-name\sitemaps\sitemap.xml, it should also render at example.com/client-name/sitemaps/sitemap.xml.

Creating the Reverse Proxy

The reverse proxy is the conduit between your Screaming Frog SEO Spider file creation and your website. I won’t get into the details of a reverse proxy, plenty of more qualified people have written about that, but essentially what you’re doing here is rerouting a request for /sitemap.xml to a different location so the URL stays the same, but the rendered content is not from the server’s root folder, it comes from the alternative file you’re dropping with the crawl.

Here’s what a reverse proxy looks like in web.config – if you need to setup up your reverse proxies in .htaccess they’ll look a bit different.

BONUS:
While you’re at it, drop a robots.txt file in the same folder you’re storing your sitemaps in and have them reverse proxy that too. One more box ticked off the mission-autonomy list. No more waiting on developers to drop a new robots file.

Testing

Since you’re affecting how the production site functions, you’re going to want to do testing here.

I set up the reverse proxy in a staging environment first, but if you don’t have access to that, I’d recommend syncing up with the devs so you can test right away and have them removed immediately if you encounter problems. I always open the SEO Spider generated file and make a minor change, then refresh the XML sitemap on the site.

Wrapping It Up

All in all, I’d say this project takes no more than 2 hours to set up on a new site – but account for more if you’ve got a sitemap index.

We’ve been running this for about 2 years and has increased the frequency with which our XMLs are refreshed, and all but removed the time spent creating them. I do spend about 15 minutes doing a quick review of them on Monday morning after the update.

The time investment I put into the initial build was well worth it. My team no longer has to create these manually which means one less monotonous task and frees up time for them to focus on deeper analysis, and more rewarding projects.

Logan Ray

Logan got into SEO back in 2011 - back when we had organic keyword data in GA! Starting with small clients and working up to more complex websites with fun problems to solve. As the Director of SEO and Analytics at Beacon in North Carolina, he has a knack for finding creative solutions to technical problems.

11 Comments

Raman Sharma 3 years ago

Could you please help me to know how can I setup XML sitemap for my website? This website has been developed in WordPress CMS.

Reply
- screamingfrog 3 years ago
  
  You can just use the Yoast from SEO plugin which has an auto XML Sitemap –
  
  https://yoast.com/wordpress/plugins/seo/
  
  Cheers.
  
  Dan
  
  Reply
- Marston Gould 2 years ago
  
  Yoast hasn’t really been updated or changed in years and is a black box.
  This allows for further customization.
  
  Reply
Barnaba Mądrecki 3 years ago

Big feature! Congrats! It would be very usefull to see tutorial on Youtube “howto”. I believe many users will need some help with auto upload on server.
Once again, thanks for this feature!

Reply
Aaron Taylor 3 years ago

It’s pretty interesting just watching a Sitemap generation tool crawl your site and count the number of pages, along with their sizes.

Reply
Mariusz Kubiak 3 years ago

It’s a beautiful thing. Again, another handy feature.

Reply
Sergei 3 years ago

Great solution for non-wordpress websites, it helped me to save a lot of time. Thanks a lot for making SF better and more useful all the time!

Reply
- adi alon 2 years ago
  
  I really agree.. Just a wonderful explanation without a doubt and helps me a lot
  
  Reply
Shawn 3 years ago

This is very helpful! What if you have a very large site with well over the 50k limit for a sitemap? Is there a way to have it break up the sitemap files into smaller files?

Reply
- Logan Ray 3 years ago
  
  We set this up for several clients that had nested sitemaps. Just need to set multiple scheduled crawls for however you want to segment them and map the reverse proxy accordingly.
  
  Reply
myspeed 3 years ago

Great solution for non-wordpress websites, it helped me to save a lot of time

Reply