How To Create An XML Sitemap Using The SEO Spider
This tutorial walks you through how you can use the Screaming Frog SEO spider to generate XML Sitemaps. To get started, you’ll need to download the SEO spider which is free in lite form, for up to 500 URLs. You can download via the buttons in the right hand side bar.
If you’d like to crawl more than 500 URLs, you can buy an annual licence, which removes the crawl limit and opens up the configuration options.
The next steps to creating a XML Sitemap are as follows –
1) Crawl The Website
Open up the SEO Spider, type or copy in the website you wish to crawl in the ‘enter url to spider’ box and hit ‘Start’.
2) Click ‘Sitemaps > Create XML Sitemap’
When the crawl has reached 100% and finished, click the ‘Create XML Sitemap’ option under ‘Sitemaps’ in the top level menu.
This will open up a number of sitemap configuration options.
3) Select ‘Pages’ To Include
Only HTML pages included in the ‘internal’ tab with a ‘200’ OK response from the crawl will be included in the XML sitemap as default. So you don’t need to worry about redirects (3XX), client side errors (4XX Errors, like broken links) or server errors (5XX) being included in the sitemap.
Pages which are blocked by robots.txt, set as ‘noindex’, have been ‘canonicalised’ (the canonical URL is different to the URL of the page), paginated (URLs with a rel=“prev”) or PDFs are also not included as standard. This can all be adjusted within the XML Sitemap ‘pages’ configuration, so simply select your preference.
You can see which URLs are ‘noindex’, ‘canonicalised’ or have a rel=“prev” link element on them under the ‘Directives’ tab and using the respective filters.
4) Exclude Pages From The XML Sitemap
Outside of the above configuration options, there might be additional ‘internal’ HTML 200 response pages that you simply don’t want to include within the XML Sitemap.
For example, you shouldn’t include ‘duplicate’ pages within a sitemap. If a page can be reached by two different URLs, for example http://example.com and http://www.example.com (and they both resolve with a ‘200’ response), then only a single preferred canonical version should be included in the sitemap. You shouldn’t include URLs with session ID’s (you can use the URL rewriting feature to strip these during a crawl), there might be some URLs with lots of parameters that are not needed, or just sections of a website which are unnecessary.
There’s a few ways to make sure they are not included within the XML Sitemap –
- If there are sections of the website or URL paths that you don’t want to include in the XML Sitemap, you can simply exclude them in the configuration pre-crawl. As they won’t be crawled, they won’t be included within the ‘internal’ tab or the XML Sitemap.
- If you have already crawled URLs which you don’t want included in the XML Sitemap export, then simply highlight them in the ‘internal tab’ in the top window pane, right click and ‘remove’ them, before creating the XML sitemap.
- Alternatively you can export the ‘internal’ tab to Excel, filter and delete any URLs that are not required and re-upload the file in list mode, before generating the XML sitemap.
5) Choose The Last Modified Date
This is a completely optional attribute to include within an XML Sitemap, so you can ‘untick’ the ‘include the lastmod tag’ box if you don’t want to include the date of the last modification of the file. It’s just a hint to the search engines when the page was last updated.
If you wish to include the ‘lastmod’, then simply select whether you’d like to use the ‘last modified’ response provided directly from your server (and seen within the ‘Last Modified’ column in the ‘Internal’ tab) or use a custom date.
6) Select The ‘Priority’ of URLs
‘Priority’ is an optional attribute to include in an XML Sitemap. You can ‘untick’ the ‘include priority tag’ box, if you don’t want to set the priority of URLs. The priority provides a hint to the search engines of the importance of a URL, relative to other URLs on your site. Valid values range from 0.0 up to the highest priority of 1.0, with the default at 0.5.
The SEO Spider allows you to configure these based upon ‘level’ (the depth) of the URLs. You can view the ‘level’ of URLs under the ‘level’ column in the ‘Internal’ tab.
As shown in the screenshot above, by default the homepage (or start page of the crawl) is set to the highest priority of ‘1’, descending by 0.1 in priority by each level of depth down to 0.5 for level 5+. These can be adjusted to your own preference.
Please remember, the ‘priority’ of URLs, will not influence how they are scored within the search engines. The ‘priority’ is used to increase the likelihood of the most important pages being crawled and indexed. In reality, Google do a very good job of working this out algorithmically.
7) Select The ‘Change Frequency’ of URLs
The ‘changefreq’ is another optional attribute which ‘hints’ at how frequently the page is likely to change.
The SEO Spider allows you to configure these based on the ‘last modification’ response or ‘level’ (depth) of URLs. The ‘calculate from last modified header’ option means that if the page has been changed in the last 24 hours, it will be set to ‘daily’, if not, it’s set as ‘monthly’.
Please do remember, these are not commands to the search engines, merely ‘hints’. Google will essentially crawl a URL as frequently as it determines algorithmically, over any ‘hint’ provided by you in the XML sitemap.
8) Select Images To Include In The Sitemap
It’s entirely optional whether to include images in an XML sitemap. If the ‘include images’ option is ticked, then all images under the ‘Internal’ tab (and the ‘Images’ tab) will be included by default. If your images are on a CDN, sub domain or reside externally they will appear under the ‘external’ tab within the UI. You can input regex into the configuration to include these within the XML Sitemap.
Usually you don’t really need to include images such as your own logo, spacers, or social media profile icons within the XML Sitemap, so you can select to only include images with a certain number of source attribute references to exclude these.
Often images like logos are linked to sitewide, while images on product pages as an example, which you usually want to include, might only be linked to once of twice. There is a ‘IMG Inlinks’ column in the ‘images’ tab which shows how many times an image is referenced to help adjust the number of ‘inlinks’ which might be suitable for inclusion.
You can also right-click and ‘remove’ any images you don’t want to include as well, in the way way as any other URL.
9) Click ‘Next’ To Generate The XML Sitemap
When you have finished configuring the various sitemap attributes and options, you can simply click ‘next’ to create the XML Sitemap. A sitemap file can’t contain more than 50,000 URLs and must be no larger than 50 MB uncompressed. Hence, if you have over 49,999 URLs the SEO spider will automatically create additional sitemap files and a sitemap index file referencing the sitemap locations.
Then click ‘save’ to your preferred location on your machine. While that’s all the steps required to create the XML sitemap, there are a couple more steps we recommend afterwards!
Submit Your XML Sitemap To Google
The XML sitemap is now ready to submit to the search engines. We highly recommend submitting the XML Sitemap to Google via Webmaster Tools as a way of tracking indexation.
Insert A Sitemap Entry Into Your Robots.txt File
Finally, we recommend including the following line entry anywhere within your robots.txt file, to inform the search engines of the XML Sitemaps’ existence (regardless of already submitting it to Google Webmaster Tools) –
That’s it! Hopefully the above guide helps illustrate how to use the SEO Spider software to generate a Google XML Sitemap for your website.