I have had a couple of cases (and queries) involving secure pages (https://) and duplicate content recently, so thought it would be a useful area to discuss.

duplicate

Hypertext transfer protocol secure (https) pages are often used for payment transactions, logins and shopping baskets to provide an encrypted and secure connection. Secure pages of course can be crawled and indexed by the search engines like regular pages. Although it might be hard to spot the difference between a http and https version of a page, they are of course technically different URI (it might only be a ‘s’ that’s the difference!) and they will be treated as a separate page by the search engines.

So as an example, the two URI below would be seen as different pages –

http://www.screamingfrog.co.uk/

https://www.screamingfrog.co.uk/

This is often not a major issue, but we know duplicate content can be a problem as it causes dilution of link equity (splitting of pagerank between pages rather than combining to one target) aswell as a waste of crawl allowance.

So How Do Secure Pages Get In The Index?

Well, like any URI, they are either found via internal or external links. So either you are linking to the secure page from the website, or someone else externally is linking to the page (or another internal page connected to it!) and hence why it has been crawled and indexed. You can find secure pages in Googles index via the site: and inurl:https commands, like this example (we have zero results, wahey!).

However, one of the most common things we find is the use of a single secure page from a login or shopping cart / basket which then contains relative urls. For example –

“/this-is-a-relative-url/”

As shown above, relative urls of course don’t contain protocol information (whether they are http or https!). They simply use the same protocol as the parent page (unless stipulated in another way like a base tag). Hence, crawled from a secure page, the url would therefore be secure (https). Often entire websites can then be crawled in secure format by a simple switch like this!

So, What Steps Should You Take To Ensure Your Secure Pages Are Not Indexed?

1) First of all, make sure you use the correct protocol on the correct pages. Only secure pages that genuinely need to be, should be secure, like shopping basket, login or checkout pages etc. Product pages on the whole, shouldn’t be so make sure users can’t browse these and potentially link to secure versions of these pages.

2) Use absolute URLs – Absolute urls define the hyper text transfer protocol and don’t leave it to chance. So if you have a secure page that can be crawled (via internal or external links), make sure you have absolute urls.

      3) You could also robots.txt out a shopping basket or login page so the search engines don’t crawl the page. Be careful not to block any other secure pages that you DO want in the index, or any secure pages which might of already accured some link equity (see point 5!). You can also consider the use of a ‘nofollow’ link attribute to the login/shopping basket page. This is the only page we might recommend using a nofollow on for internal links. Matt Cutts from Google commented on this previously in a Google Webmaster Help video. Please note, you shouldn’t have to take this step if you can follow the other steps in this guide. Ideally if you don’t want your shopping or login page in the index, use a meta noindex tag.

      What Should I do If I Already Have Duplicate Secure Pages In The Index?

      4) Find the reason why you have secure pages in the index, either internal or external links and follow the steps already outlined above. If you can’t find the link source internally (shameless plug), try the SEO spider which will do it for you. If it’s not an internal link, then there could be external links in play.

      5) 301 permanently redirect the secure (https) page to the correct http version. This will mean the search engines drop the https out of the index, rank the correct http version and pass any link equity (or pagerank!) to the correct version of the page. If you can’t use a 301 redirect, then try using the canonical link element instead. Obviously make sure you haven’t blocked any of the pages you are going to redirect via robots.txt!

      Hopefully this article will provide a useful guide to help remove any duplicate secure pages (https).

      screamingfrog (33 Posts)

      Dan Sharp is founder & Director of Screaming Frog. He has developed search strategies for a variety of clients from international brands to small and medium sized businesses and designed and managed the build of the innovative SEO Spider software.