How To Debug Missing Pages In A Crawl
How To Debug Missing Pages In A Crawl
One of our most frequently asked questions is ‘Why is the SEO Spider not finding all my pages?’ and ‘Why is this URL, or section not being crawled?’.
Or sometimes just ‘Why is this tool not working properly!?’.
Learning how to debug missing pages is an important skillset as an SEO, as it can help identify potential issues with crawling and indexing, or at the very least – provide you with a better technical understanding of the site set-up when working to improve organic visibility.
If your site is not crawling past the first page, then check out our tutorial on ‘Why Won’t My Website Crawl?‘.
This tutorial walks you through how to identify the cause of missing pages in a crawl when using the Screaming Frog SEO Spider, including where to start, and the most common issues encountered.
Why Might Pages Be Missing from a Crawl?
It’s important to remember the fundamentals of how a crawler like the Screaming Frog SEO Spider works to understand why it might not be able to discover pages.
The SEO Spider discovers links by scanning the HTML code for <a> tags with an href attribute from the start page of a crawl. It will then crawl these links to discover more links.
The crawl is breadth first, meaning it will crawl URLs at the present depth, before moving onto crawling URLs at the next depth level. This means it will crawl from the start page first, then crawl the URLs it links to, before moving onwards to the URLs they link to, and so on – until it completes the crawl.
In the stereotypical site architecture, this means the homepage, then categories, subcategories and products will be crawled. The order of crawl isn’t always that important, but the crawl path is – to be able to find a page, there must be a crawlable path to it from internal links from the starting point of a crawl.
While it can sometimes be assumed there is an issue with the crawler when a page can’t be discovered, usually there are two main reasons for it.
- 1) They are not linked to in a way that can be discovered (via a crawlable link path).
- 2) The SEO Spider configuration is not set-up to find them.
It’s important to remember the SEO Spider doesn’t crawl the whole web like a search engine, so it will only find pages linked to internally. There can be some disparity between what is in Google’s index and what the SEO Spider crawls due to the differences in the way they crawl.
The disparity doesn’t make crawl data less useful though, it can help uncover gaps that might be assumed not to be an issue from just Google index data.
Why Is This Important?
Users need links to navigate a website and find content they want. Search engines use links to discover pages and index them to be able to show them in their search results.
If a website doesn’t link to a page in a way that Google can follow, then it might not be indexed.
Search engines use links as a vote in ranking in their PageRank algorithm. If a website doesn’t link to a page then it might not rank as well in the search results.
Google also use links to better understand the relationship between pages, and their relevance through anchor text and other contextual signals.
Where Do You Start to Debug Missing Pages?
To find a page there must be crawlable path to it from the starting point of a crawl for the SEO Spider to follow. So the best place to start the debugging process is to answer:
- 1) What page(s) are not being crawled?
- 2) What page(s) on the website link to them?
Analyse the URLs found in the Internal tab of the SEO Spider to see what pages are missing from a crawl. You can sort URLs alphabetically and then scroll through to scan them by URL pattern for missing sections or pages.
Or use the right hand ‘Site Structure’ tab for an overview of the number of URLs discovered by section. If a section is missing from the structure, or has less URLs than expected, then that’s where to analyse further.
If you’re able to answer which pages link to them, the next question is – Are they being found in the crawl? Again, you can quickly search in the Internal tab.
This helps you work backwards to the source of the issue.
For example, if product pages are not being found and they are linked to from category pages, and those category pages are being crawled – then you know that’s the source of the issue to investigate further.
If you reach this step and can’t answer ‘what page(s) on the website link to them?’, then that is likely the reason they haven’t been crawled.
How Do You Debug Missing Pages?
When you have identified the source page which should link to the missing page(s) or it’s crawl path, you’ll need to analyse it to see how it links to the missing pages.
The SEO Spider will only follow links that use <a> tags with resolvable URLs in the HTML.
<a href="https://example.com" </a>
This is the same as Google who provide documentation on crawlable links that they can and cannot follow.
Remember not to interact with the page before any analysis, as the SEO Spider and search bots do not click around like a user and load additional events. Open up the page in Chrome, right click and ‘view page source’.
If the missing page is not linked in the raw or rendered HTML or is not within an <a> tag with an href attribute, then this has helped identify an SEO issue to resolve on the website with their developers.
Common Reasons for Missing Pages
We are in the fortunate(?) position to see lots of different technical issues, which can impact the crawling and indexing of pages. Typically the most common reasons for not finding all pages in a crawl are the following.
No Internal Links
It may surprise you, but this is typically the most common reason for not crawling a page. It’s simply not linked to internally on the website.
If it’s not linked to, then the SEO Spider will not discover it by default. This can obviously be an issue for users, and discovery and indexing of the pages by search engines.
URLs that are not internally linked to can be known as orphan pages, which can be found by integrating alternative methods of discovery, such as XML Sitemaps, Google Analytics and Search Console APIs into a crawl.
Please see our tutorial on how to find orphan pages.
You’ll also find that the page has no hyperlinks in the lower ‘outlinks’ tab, as they are not being rendered and therefore can’t be seen.
If you’re unable to find all the news articles, blog posts or products on a website, are paginated pages being crawled? If not, why not? It’s time to analyse their set up.
Check the main blog or news page and click the pagination in your browser to see if they have unique URLs.
Getting pagination wrong is not limited to just blog and news articles, we also see this regularly for products. Buttons are used to load more products onto the same page.
In this case they don’t use a crawlable link type and if the URL is accessed by itself, it’s a search result page that only loads the main body content.
The basic principles for pagination are –
- Ensure paginated pages have a unique URL such as /blog/2/ and avoid using fragments.
- Link to paginated pages using <a> tags with an href attribute
- To help Google understand the relationship between pages, include links from each page to the following page
Paginated pages should contain unique paginated content on them for users, which can then be crawled and indexed by search engines.
Google have an excellent guide on pagination best practices for ecommerce.
Uncrawlable Link Types
Google and the SEO Spider will only follow links that use proper <a> tags with resolvable URLs in the HTML.
<a href="https://example.com" </a>
So analyse the raw and rendered HTML and verify that links have been coded correctly. We see lots of variety and mistakes, such div or span tags with an ahref.
Google and the SEO Spider will not follow other formats such as the following –
- <a routerLink=”some/path”>
- <span href=”https://example.com”>
- <a onclick=”goto(‘https://example.com’)”>
- <span data-link=”https://example.com”>
Please see Google’s own guidelines on making links crawlable for their search engine.
The SEO Spider does not interact with page content like a user and will not use search forms to run searches and find content. If a website has content that is only accessible by running a search, then it will not be found.
There must be crawlable link paths to any content that needs to be indexed. Ensure any essential content is internally linked through alternative user journeys.
In the example above, properties are only accessible through search, and could be linked to internally by using location pages linked by additional menus.
The SEO Spider configuration can be a big part in what is being crawled.
It can often be useful to clear your config at the outset of a crawl via ‘File > Config > Clear Default Config’, to rule out any previous adjustments impacting what it’s able to discover during a crawl.
If the above does not help, then please also review the following:
- Links or linking pages do not have ‘nofollow’ attributes or directives preventing the SEO Spider from following them. By default the SEO Spider obeys ‘nofollow’ directives unless the ‘follow internal nofollow‘ configuration is checked.
- The expected page(s) are on the same subdomain as your starting page. By default links to different subdomains are treated as external unless the Crawl all subdomains option is checked.
- If the expected page(s) are in a different subfolder to the starting point of the crawl the Crawl outside start folder option is checked.
- You do not have an Include or Exclude function set up that is limiting the crawl.
- Ensure category pages (or similar) were not temporarily unreachable during the crawl, giving a connection timeout, server error etc. preventing linked pages from being discovered.
- By default the SEO Spider won’t crawl the XML Sitemap of a website to discover new URLs. However, you can select to ‘Crawl Linked XML Sitemaps‘ in the configuration.
Learning how to debug why pages are missing from a crawl will help you have a better understanding about potential problems search engines might have when crawling, indexing and ranking a website in organic search.
Understanding how a crawler works by following links to discover other pages, will help you make better decisions around site architecture and internal linking.
If you have any queries on missing pages, then just get in touch with our team via support with details on the issue, including the missing page, and source page that links to it.