Automating XML Sitemaps With Screaming Frog
Automating XML Sitemaps With Screaming Frog
When Screaming Frog first rolled out the scheduled crawl feature, I was thrilled. I knew there were some great use-cases and couldn’t wait to find them. I familiarized myself with the feature by setting up a standard weekly crawl for each of my clients. Then it was on to find some more advanced uses.
Around the same time, I was exploring ways for our team to gain more autonomy. The more processes we could take into our own hands, the better. One of the quickest wins was automating XML sitemaps; doing so would give us more control and efficiency. My motivators were three-fold – sites on a CMS without dynamically generated XML sitemaps, no access to the server, and efficiency. I’m sure there are more I haven’t come across yet, and I’d love to hear from you on more use-cases.
Use the Screaming Frog SEO Spider to automate your XML sitemaps if:
- You use an uncommon CMS platform
- You want more control over the contents of your XML sitemaps
- You have limited access to servers/devs
What you’ll need:
- The Paid version of the Screaming Frog SEO Spider
- An IT/tech team that can implement reverse proxies
- A dedicated machine (this isn’t necessary but it’ll make your life a lot easier)
What are the steps for automating your XML sitemaps?
- Setting up your automated crawl
- Establishing a central location for storing SF output files
- Creating the reverse proxy
Why Automate XML Sitemaps?
The first reason is because you use an uncommon CMS. Platforms like WordPress and Shopify offer great solutions to keeping your XML sitemap fresh. But what if the site uses a proprietary CMS without a built-in system or a huge publicly-created plugin library? The solution I’m about to explain is CMS-agnostic, meaning you can set it up for any site, regardless of what it’s built on.
The next key motivator for using the Screaming Frog SEO Spider to auto-generate your XML sitemaps is to customise them. Some platforms that can dynamically update sitemaps don’t give you much control over what’s included in it, they just dump everything in there. You may want to exclude a folder or specific set of URLs, which is far easier to do in the SEO Spider settings.
Lastly, you might run into a situation like mine where you don’t have access to the server to update these files on your own. This is where the reverse proxy comes in, we’ll get to that later. If you have access to the appropriate server folders, you can skip the reverse proxy step.
What You’ll Need
The Paid Version of the Screaming Frog SEO Spider
This goes without saying, but if you’re not paying for the Screaming Frog SEO Spider, stop reading this and do it. Now. But seriously, the scheduled crawl feature isn’t available in the free version, so this is a requirement for automating your XML sitemaps.
An IT/Tech Team That Can Implement Reverse Proxies
In my experience, IT and dev groups aren’t very open to the idea of giving SEOs access to server folders. If you’re in this boat, you’re going to need someone to implement your reverse proxies.
Optional: A Dedicated Machine
While a dedicated machine isn’t required for this automation, it’s helpful.
In order for scheduled crawls to run, the machine they’re set up on needs to be on – if you run these over the weekend and shut down your computer on Friday, no more automation. For this reason, I had our IT group set up a machine that we remote into for the initial configuration.
What Are the Steps for Automating Your Xml Sitemaps?
Setting up Your Automated Crawl
The first thing you want to do is configure your scheduled crawl. If you haven’t played around with this yet, I HIGHLY recommend it. I’ve got several scheduled crawls set up, most of which are for XMLs, but I have a few other purpose-built crawls that run regularly.
As for cadence, I run mine weekly, but you can set this up for any interval you want. Depending on how often content is added to your site, you may want to toggle this up or down.
You’ll also need to assess whether or not you need a custom-built crawl settings file. This will mostly depend on whether or not you want to customise the contents of your sitemap. Most of my clients need this. In some cases, it’s because we have a sitemap index and therefore a different settings file for each of the segmented XMLs. In other cases, there are some customisations I wanted to bake in.
Setting up a scheduled crawl (File > Scheduling) is simple: Give it a name, and set your frequency and timing. I recommend using the description as a place to note frequency. This is helpful when you’ve got several to set up – the description shows up in the list but not the date/time.
Running in headless mode is required for exports, so be sure to check that box. You’ll also want to overwrite files in output so your filename doesn’t change. In order for the reverse proxy to work, you need a consistent file path. And of course, save the crawl and export the XML sitemap.
If you’re setting up a sitemap index with nested sitemaps within, you’ll need to set up individual crawls using includes and excludes to segment them the way you want.
One final thing to note with regards to your crawl settings – go to the Sitemap Export Configuration and choose your settings there before you save the crawl settings file. This will ensure the export format is what you want – otherwise, it includes things like change frequency and priority by default.
Establishing a Central Location for Storing Screaming Frog Output Files
In order for the reverse proxy to work, make sure you have your scheduled crawl dump the files in a specific location and as mentioned above, turn on the ‘overwrite files’ option rather than date-stamping your files. This server location will also need to be accessible via the web. So if your file path on the server is Z:\\client-name\sitemaps\sitemap.xml, it should also render at example.com/client-name/sitemaps/sitemap.xml.
Creating the Reverse Proxy
The reverse proxy is the conduit between your Screaming Frog SEO Spider file creation and your website. I won’t get into the details of a reverse proxy, plenty of more qualified people have written about that, but essentially what you’re doing here is rerouting a request for /sitemap.xml to a different location so the URL stays the same, but the rendered content is not from the server’s root folder, it comes from the alternative file you’re dropping with the crawl.
Here’s what a reverse proxy looks like in web.config – if you need to setup up your reverse proxies in .htaccess they’ll look a bit different.
While you’re at it, drop a robots.txt file in the same folder you’re storing your sitemaps in and have them reverse proxy that too. One more box ticked off the mission-autonomy list. No more waiting on developers to drop a new robots file.
Since you’re affecting how the production site functions, you’re going to want to do testing here.
I set up the reverse proxy in a staging environment first, but if you don’t have access to that, I’d recommend syncing up with the devs so you can test right away and have them removed immediately if you encounter problems. I always open the SEO Spider generated file and make a minor change, then refresh the XML sitemap on the site.
Wrapping It Up
All in all, I’d say this project takes no more than 2 hours to set up on a new site – but account for more if you’ve got a sitemap index.
We’ve been running this for about 2 years and has increased the frequency with which our XMLs are refreshed, and all but removed the time spent creating them. I do spend about 15 minutes doing a quick review of them on Monday morning after the update.
The time investment I put into the initial build was well worth it. My team no longer has to create these manually which means one less monotonous task and frees up time for them to focus on deeper analysis, and more rewarding projects.