An SEOs Guide To Crawling HSTS
HTTP Strict Transport Security
HTTP Strict Transport Security (HSTS) is a standard, defined in RFC 6797, by which a web server can declare to a client that it should only be accessed via HTTPS.
The client, typically a web server or crawler, will then make all future requests over HTTPS, even if following a link to an HTTP URL. When this happens the SEO Spider, as of version 8.0, it shows a Status Code of 307, a Status of “HSTS Policy” and Redirect Type of “HSTS Policy”.
Here’s the SEO Spider running on our HSTS test site https://www.screamingprojects.com/hsts/.
Here’s how Chrome handles the same situation (I’ve ticked the ‘Preserve Log’ option at the top, otherwise the 307 is lost).
Unlike a 301 or a 302, this redirect isn’t actually sent by the web server. It’s just an internal representation in the browser and SEO Spider. No request is actually sent to the web server, it’s turned around internally.
When a webserver declares it should only be contacted via HTTPS, it sets an expiry on that declaration, so the use of the 307 response makes sense for this, as 307 means ‘Temporary Redirect’.
The HSTS protocol is based on the server sending a single header called Strict-Transport-Security which must only be sent over HTTPS. If it is sent over HTTP it is ignored. There are two directives associated with the header:
- max-age: This is mandatory and specifies the number of seconds for which the server must only be contacted via HTTPS.
- includeSubDomains: This is an optional field. If set, specifies that HSTS policy should also apply to any subdomains.
Enables HSTS for a year:
Forces expiry of HSTS Policy:
Enables HSTS policy for a month for this domain, and all subdomains:
Strict-Transport-Security: max-age=2592000 ; includeSubDomains
As the HTTP to HTTPS rewriting happens internally on the client, there are several key benefits to this over just using a site wide HTTP -> HTTPS redirect.
- Reduced communication over non-secure protocols.
- Improved performance, as a round trip is avoided each time an HTTP link is encountered.
- Reduced load on the web server.
A site wide HTTP -> HTTPS redirect is still needed however. As the Strict-Transport-Security header is ignored unless it’s sent over HTTPS. So if the first visit to your site is not via HTTPS, you still need that initial redirect to HTTPS to deliver the Strict-Transport-Security header.
Given the above, you may expect to never see a 307 as the first response shown in the SEO Spider, but this can happen. This is because behind the scenes the SEO Spider makes an HTTP request for the robots.txt file, receives a 301 to the HTTPS version of the site, then receives the Strict-Transport-Security header, so will then report 307 for the first URL crawled. If you disable robots.txt checking the SEO Spider will report a 301.
Enabling and Disabling HSTS
By default HSTS policy is disabled, but it can be enabled by ticking the ‘Respect HSTS Policy’ configuration under ‘Configuration > Spider > Advanced’ in the SEO Spider.
This means the SEO Spider will ignore HSTS completely and report upon the underlying redirects and status codes unless this configuration is enabled.