SEO Spider

How To Debug Missing Pages In A Crawl

One of our most frequently asked questions is ‘Why is the SEO Spider not finding all my pages?’ and ‘Why is this URL, or section not being crawled?’.

Or sometimes just ‘Why is this tool not working properly!?’.

Learning how to debug missing pages is an important skillset as an SEO, as it can help identify potential issues with crawling and indexing, or at the very least – provide you with a better technical understanding of the site set-up when working to improve organic visibility.

If your site is not crawling past the first page, then check out our tutorial on ‘Why Won’t My Website Crawl?‘.

This tutorial walks you through how to identify the cause of missing pages in a crawl when using the Screaming Frog SEO Spider, including where to start, and the most common issues encountered.

Why Might Pages Be Missing from a Crawl?

It’s important to remember the fundamentals of how a crawler like the Screaming Frog SEO Spider works to understand why it might not be able to discover pages.

The SEO Spider discovers links by scanning the HTML code for <a> tags with an href attribute from the start page of a crawl. It will then crawl these links to discover more links.

The crawl is breadth first, meaning it will crawl URLs at the present depth, before moving onto crawling URLs at the next depth level. This means it will crawl from the start page first, then crawl the URLs it links to, before moving onwards to the URLs they link to, and so on – until it completes the crawl.

In the stereotypical site architecture, this means the homepage, then categories, subcategories and products will be crawled. The order of crawl isn’t always that important, but the crawl path is – to be able to find a page, there must be a crawlable path to it from internal links from the starting point of a crawl.

While it can sometimes be assumed there is an issue with the crawler when a page can’t be discovered, usually there are two main reasons for it.

1) They are not linked to in a way that can be discovered (via a crawlable link path).
2) The SEO Spider configuration is not set-up to find them.

It’s important to remember the SEO Spider doesn’t crawl the whole web like a search engine, so it will only find pages linked to internally. There can be some disparity between what is in Google’s index and what the SEO Spider crawls due to the differences in the way they crawl.

The disparity doesn’t make crawl data less useful though, it can help uncover gaps that might be assumed not to be an issue from just Google index data.

Why Is This Important?

Users need links to navigate a website and find content they want. Search engines use links to discover pages and index them to be able to show them in their search results.

If a website doesn’t link to a page in a way that Google can follow, then it might not be indexed.

Search engines use links as a vote in ranking in their PageRank algorithm. If a website doesn’t link to a page then it might not rank as well in the search results.

Google also use links to better understand the relationship between pages, and their relevance through anchor text and other contextual signals.

Where Do You Start to Debug Missing Pages?

To find a page there must be crawlable path to it from the starting point of a crawl for the SEO Spider to follow. So the best place to start the debugging process is to answer:

1) What page(s) are not being crawled?
2) What page(s) on the website link to them?

Analyse the URLs found in the Internal tab of the SEO Spider to see what pages are missing from a crawl. You can sort URLs alphabetically and then scroll through to scan them by URL pattern for missing sections or pages.

Or use the right hand ‘Site Structure’ tab for an overview of the number of URLs discovered by section. If a section is missing from the structure, or has less URLs than expected, then that’s where to analyse further.

If you’re able to answer which pages link to them, the next question is – Are they being found in the crawl? Again, you can quickly search in the Internal tab.

This helps you work backwards to the source of the issue.

For example, if product pages are not being found and they are linked to from category pages, and those category pages are being crawled – then you know that’s the source of the issue to investigate further.

If you reach this step and can’t answer ‘what page(s) on the website link to them?’, then that is likely the reason they haven’t been crawled.

How Do You Debug Missing Pages?

When you have identified the source page which should link to the missing page(s) or it’s crawl path, you’ll need to analyse it to see how it links to the missing pages.

The SEO Spider will only follow links that use <a> tags with resolvable URLs in the HTML.

<a href="https://example.com" </a>

This is the same as Google who provide documentation on crawlable links that they can and cannot follow.

Remember not to interact with the page before any analysis, as the SEO Spider and search bots do not click around like a user and load additional events. Open up the page in Chrome, right click and ‘view page source’.

Then run a search (control + F) for the missing URL. This is the raw HTML before JavaScript has run. The URL would ideally be there in a crawlable link.

If it’s not there, right click and ‘inspect element’ which shows the rendered HTML after JavaScript and search again.

If the missing URL is there, then you know it relies upon JavaScript. In this case, read our tutorial on How To Crawl JavaScript Websites.

If the missing page is not linked in the raw or rendered HTML or is not within an <a> tag with an href attribute, then this has helped identify an SEO issue to resolve on the website with their developers.

Common Reasons for Missing Pages

We are in the fortunate(?) position to see lots of different technical issues, which can impact the crawling and indexing of pages. Typically the most common reasons for not finding all pages in a crawl are the following.

No Internal Links

It may surprise you, but this is typically the most common reason for not crawling a page. It’s simply not linked to internally on the website.

If it’s not linked to, then the SEO Spider will not discover it by default. This can obviously be an issue for users, and discovery and indexing of the pages by search engines.

URLs that are not internally linked to can be known as orphan pages, which can be found by integrating alternative methods of discovery, such as XML Sitemaps, Google Analytics and Search Console APIs into a crawl.

Please see our tutorial on how to find orphan pages.

JavaScript

By default the SEO Spider will crawl the raw HTML before JavaScript has been executed.

If the site is reliant entirely on client-side JavaScript to populate content and links, then you will often find only the homepage is crawled with a 200 ‘OK’ response, with a few JavaScript and CSS files.

You’ll also find that the page has no hyperlinks in the lower ‘outlinks’ tab, as they are not being rendered and therefore can’t be seen.

While this is often a sign that the website is using a JavaScript framework, with client-side rendering – there can be more subtle and harder to spot JavaScript dependencies where it’s used for specific functions.

One of the most common is JavaScript used on category pages of ecommerce websites to load products. If you have identified that product pages are not being crawled, but category pages are, it’s time to investigate.

An example of this behaviour is the Garmin website on category pages. You can right click and ‘view page source’ in Chrome and search for product URLs on the page or specific elements of text to see they are not present before JavaScript.

If you right click and ‘inspect element’, which is after JavaScript has been processed, you’ll see the links within an <a> tag with an href attribute.

To help identify JavaScript in the SEO Spider and crawl the product pages, you can enable JavaScript Rendering mode via ‘Config > Spider > Rendering’.

The JavaScript tab and relevant ‘Contains JavaScript Links’ and ‘Contains JavaScript Content’ filters report pages that have links or content that are only in the rendered HTML after JavaScript.

JavaScript rendering mode, tab and filters

You can view the ‘Outlinks’ tab of the category page to see whether product links have now been discovered, and identify those only in the rendered HTML after JavaScript by adjusting the ‘All Link Types’ filter to ‘Hyperlinks’ and the ‘All Link Origin Types’ filter to ‘Rendered HTML’ only.

In JavaScript rendering mode the SEO Spider will crawl both the original and rendered HTML to identify pages that have content or links only available client-side and report other dependencies and potential issues.

Please read our ‘How To Crawl JavaScript Websites‘ tutorial.

Uncrawlable Pagination

If you’re unable to find all the news articles, blog posts or products on a website, are paginated pages being crawled? If not, why not? It’s time to analyse their set up.

Check the main blog or news page and click the pagination in your browser to see if they have unique URLs.

Some pagination simply use JavaScript to refresh the page and load additional blog posts or articles onto the page, which won’t be seen by the SEO Spider or Google.

Analyse the raw and rendered HTML to see if there are any links to paginated URLs, or if they are buttons that use JavaScript without <a> tags with an href attribute.

Getting pagination wrong is not limited to just blog and news articles, we also see this regularly for products. Buttons are used to load more products onto the same page.

In this case they don’t use a crawlable link type and if the URL is accessed by itself, it’s a search result page that only loads the main body content.

The basic principles for pagination are –

Ensure paginated pages have a unique URL such as /blog/2/ and avoid using fragments.
Link to paginated pages using <a> tags with an href attribute
To help Google understand the relationship between pages, include links from each page to the following page

Paginated pages should contain unique paginated content on them for users, which can then be crawled and indexed by search engines.

Google have an excellent guide on pagination best practices for ecommerce.

Uncrawlable Link Types

Google and the SEO Spider will only follow links that use proper <a> tags with resolvable URLs in the HTML.

<a href="https://example.com" </a>

So analyse the raw and rendered HTML and verify that links have been coded correctly. We see lots of variety and mistakes, such div or span tags with an ahref.

Google and the SEO Spider will not follow other formats such as the following –

<a routerLink=”some/path”>
<span href=”https://example.com”>
<a onclick=”goto(‘https://example.com’)”>
javascript:goTo(‘products’)
javascript:window.location.href=’/products’
<span data-link=”https://example.com”>
#

Please see Google’s own guidelines on making links crawlable for their search engine.

Search Forms

The SEO Spider does not interact with page content like a user and will not use search forms to run searches and find content. If a website has content that is only accessible by running a search, then it will not be found.

There must be crawlable link paths to any content that needs to be indexed. Ensure any essential content is internally linked through alternative user journeys.

In the example above, properties are only accessible through search, and could be linked to internally by using location pages linked by additional menus.

Configuration

The SEO Spider configuration can be a big part in what is being crawled.

It can often be useful to clear your config at the outset of a crawl via ‘File > Config > Clear Default Config’, to rule out any previous adjustments impacting what it’s able to discover during a crawl.

If the above does not help, then please also review the following:

You’re crawling in JavaScript rendering mode if the site uses JavaScript.
Links or linking pages do not have ‘nofollow’ attributes or directives preventing the SEO Spider from following them. By default the SEO Spider obeys ‘nofollow’ directives unless the ‘follow internal nofollow‘ configuration is checked.
The expected page(s) are on the same subdomain as your starting page. By default links to different subdomains are treated as external unless the Crawl all subdomains option is checked.
If the expected page(s) are in a different subfolder to the starting point of the crawl the Crawl outside start folder option is checked.
The linking page(s) are not blocked by Robots.txt. By default the robots.txt is obeyed so any links on a blocked page will not be seen unless the Ignore robots.txt option is checked. If the site uses JavaScript and the rendering configuration is set to ‘JavaScript’, ensure JS and CSS are not blocked by robots.txt.
You do not have an Include or Exclude function set up that is limiting the crawl.
Ensure category pages (or similar) were not temporarily unreachable during the crawl, giving a connection timeout, server error etc. preventing linked pages from being discovered.
By default the SEO Spider won’t crawl the XML Sitemap of a website to discover new URLs. However, you can select to ‘Crawl Linked XML Sitemaps‘ in the configuration.

Closing Thoughts

Learning how to debug why pages are missing from a crawl will help you have a better understanding about potential problems search engines might have when crawling, indexing and ranking a website in organic search.

Understanding how a crawler works by following links to discover other pages, will help you make better decisions around site architecture and internal linking.

Check out our Screaming Frog SEO Spider user guide, FAQs and tutorials for more advice and tips.

If you have any queries on missing pages, then just get in touch with our team via support with details on the issue, including the missing page, and source page that links to it.