How To Find Orphan Pages
How To Find Orphan Pages Using The SEO Spider
An orphan page is a page that cannot be found by crawling the internal links of a website from the start page. Users may have trouble getting to these pages and this makes it more difficult for search engines to discover them.
Orphan pages can occur for a variety of reasons, such as old pages being unlinked but left as published, issues with site architecture, products going out of stock but still existing, the CMS creating additional unknown URLs as part of their page templates etc.
In the SEO Spider we classify any URL where there is no observed linking path from the start point of a crawl (generally the homepage) as an orphan page. Orphan pages can have internal links from other orphan pages.
To discover orphan pages additional URL sources are required, these come from XML Sitemaps and the integration of the Google Analytics and Search Console APIs.
Why Are They Important?
Finding orphan pages is useful because it can help identify areas of a site or important pages that have no internal links. This can obviously be an issue for users, and discovery and indexing of the pages by search engines.
Orphan pages might still be indexed due to being linked to historically or from other sources (like XML Sitemaps, or external links for example), but without any internal links, they won’t be passed internal PageRank, which will impact their scoring and organic performance in the search engines.
A small number of orphan pages is common and not generally a big issue, however, at scale they can contribute to index bloat and crawl budget waste, they may result in competing pages, or just a poor experience if outdated pages are being discovered by users organically.
This tutorial walks you through how to use the Screaming Frog SEO Spider to find orphan pages from three sources, XML Sitemaps, Google Analytics, and Search Console. To crawl the whole website and open up the configuration to integrate with the three sources, an SEO Spider licence is required.
When you’re set, simply follow the steps outlined in the tutorial below.
1) Select ‘Crawl Linked XML Sitemaps’ under ‘Configuration > Spider > Crawl’
To crawl URLs in the XML Sitemap, you can choose to auto-discover pages via robots.txt (this requires a ‘Sitemap: https://www.example.com/sitemap.xml entry), or supply the destination of the XML Sitemap.
This means any new orphan URLs only discoverable via the XML Sitemap will be crawled.
2) Connect to Google Analytics under ‘Configuration > API Access’
You can connect to the Google Analytics API and pull in data for a specific account, property, view and segment directly during a crawl. To find orphan pages from organic search, remember to choose the ‘Organic Traffic’ segment.
You can set the date range to be analysed which would ideally be at least a month, as well as metrics and dimensions which can be left as default. The segment can be tweaked to ‘All Users’ or ‘Paid Traffic’ if you’re interested in finding orphan pages via other sources as well.
If you haven’t connected to GA before, have a read of our Google Analytics integration guide.
3) Select ‘Crawl New URLs Discovered In Google Analytics’
This configuration option can be found under the ‘General’ tab of the Google Analytics configuration window (Configuration > API Access > Google Analytics).
If this option isn’t enabled, then new URLs discovered via Google Analytics will only be available to view in the ‘Orphan Pages’ report. They won’t be added to the crawl queue, viewable within the user interface and appear under the respective tabs and filters.
4) Connect to Google Search Console under ‘Configuration > API Access’
You can connect to the Search Analytics API and pull in data such as impressions, clicks, CTR and position metrics directly during a crawl. To find orphan pages that are receiving impressions under search but are not linked to internally, simply choose the correct property.
You can set the date range for the data to be analysed, which would ideally be at least a month like Google Analytics.
If you haven’t connected to GSC before, have a read of our Google Search Console integration guide.
5) Select ‘Crawl New URLs Discovered In Google Search Console’
This configuration option can be found under the ‘Search Analytics’ tab of the Google Search Console configuration window (Configuration > API Access > Google Search Console).
In the same way as Google Analytics if this option isn’t enabled, then new URLs discovered via Google Search Console will only be available to view in the ‘Orphan Pages’ report. They won’t be added to the crawl queue, viewable within the user interface and appear under the respective tabs and filters.
6) Crawl The Website
Open up the SEO Spider, type or copy in the website to crawl in the ‘Enter URL to spider’ box and hit ‘Start’.
You can monitor the progress of the API’s and crawl via the progress bars and API tab.
The website, and new URLs discovered via the XML Sitemap, Google Analytics and Search Console will subsequently be crawled. Wait until the crawl finishes and reaches 100%.
6) Click ‘Crawl Analysis > Start’ To Populate Orphan URLs Filters
The majority of filters in the SEO Spider are available to view in real-time during a crawl. However, there are three respective ‘Orphan URLs’ filters under ‘Sitemaps’, ‘Analytics’ and ‘Search Console’ tabs which can only be viewed at the end of a crawl.
They required post ‘Crawl Analysis‘ for them to be populated with data (more on this in just a moment). The right hand ‘overview’ pane, displays a ‘(Crawl Analysis Required)’ message against filters that require post-crawl analysis to be populated with data. For example, there are five filters under ‘Sitemaps’ where it’s required.
The SEO Spider will only know which URLs are missing from an XML Sitemap and vice versa when the entire crawl completes. To populate these three orphan URLs filters you simply need to click a button.
However, if you have configured ‘Crawl Analysis’ previously, you may wish to double check, under ‘Crawl Analysis > Configure’ that ‘Sitemaps’, ‘Analytics’ and ‘Search Console’ are ticked. You can also untick other items that also require post-crawl analysis to make this step quicker.
When crawl analysis has completed the ‘analysis’ progress bar will be at 100% and the filters will no longer have the ‘(Crawl Analysis Required)’ message.
They will also be populated with orphan URLs data!
7) Analyse ‘Orphan URLs’ Filters Under Sitemaps, Analytics & Search Console Tabs
You’re now able to browse each tab and respective ‘Orphan URLs’ filter to view orphan pages found. For example, on the Screaming Frog website, there are some orphan URLs that error and redirect from the XML Sitemap.
While these aren’t pages that exist, they are orphan URLs that are not linked to internally on the website. These old URLs that should have been removed from the XML Sitemap in this example.
Orphan pages can have internal links from other orphan pages.
From Search Console data, there are some pages that exist on the website and respond with a 200 status code, that are not linked to internally. One of these is a guide which should really be linked to internally, while another is an old job vacancy that was removed from our careers page, but is still live and receiving organic impressions and clicks.
In the same way as the example above, the ‘Analytics’ tab and ‘orphan URLs’ filter can be viewed as well. The data from each of these tabs and filters can be exported via the ‘Export’ button on the interface.
8) Export Combined Orphan URLs via ‘Reports > Orphan Pages’
Finally, use the ‘Orphan Pages’ report if you wish to export a combined list of all orphan pages discovered.
There’s a ‘Source’ column next to each orphan URL, which provides the source of discovery. These have been abbreviated to ‘GA’ for Google Analytics, ‘GSC’ for Google Search Console and ‘Sitemaps’, for, erm, XML Sitemaps.
If you have integrated Google Analytics and Search Console in a crawl, but didn’t tick ‘Crawl New URLs Discovered In GA/GSC’ configuration, then this report will still contain data for those URLs. They just won’t have been crawled, and won’t appear under the respective tabs and filters.
Final Tip! Identify Orphan Pages In The Internal Tab Via Blank Crawl Depth
The ‘Internal’ tab includes every URL found in a crawl, including orphan URLs. To identify which URLs are orphan pages, filter for a blank ‘crawl depth’.
URLs that have not been discovered naturally via internal links during a crawl, will not have a ‘crawl depth’.
The guide above should help illustrate the simple steps required to find orphan pages using the SEO Spider.
If you have any further queries on the process outlined above, then just get in touch via support.