How To Crawl JavaScript Websites

Introduction To Crawling JavaScript

Historically search engine bots such as Googlebot didn’t crawl and index content created dynamically using JavaScript and were only able to see what was in the static HTML source code. However, Google in particular has evolved, deprecating their old AJAX crawling scheme guidelines of escaped-fragment #! URLs and HTML snapshots in October ’15, and are generally able to render and understand web pages like a modern-day browser.

JavaScript usage is up, and adoption of Google’s own JavaScript MVW framework AngularJS, other frameworks like React and single page and progressive web apps are on the rise.

It’s become more essential today to be able to read the DOM after JavaScript has come into play and constructed the web page when crawling and analysing onsite SEO.

Traditional website and SEO crawlers that are used to scan website links and content are only able to crawl static HTML, until we launched the first ever JavaScript rendering functionality into our Screaming Frog SEO Spider software, so JavaScript is executed and the DOM is read.

We integrated the Chromium project library for our rendering engine to emulate Google as closely as possible.

Screaming Frog SEO Spider 7.0

This guide contains the following 3 sections. Click and jump to a relevant section, or continue reading.

  1. 1) Why You Shouldn’t Crawl Blindly With JavaScript Enabled
  2. 2) How To Identify JavaScript
  3. 3) How To Crawl JavaScript Websites

If you already understand the basic principles of JavaScript and just want to crawl a JavaScript website, skip straight to our guide on configuring the Screaming Frog SEO Spider tool to crawl JavaScript sites. Or, read on.

Why You Shouldn’t Crawl Blindly With JavaScript Enabled

While it’s essential in auditing today, we recommend utilising JavaScript crawling selectively when required and only keeping this enabled by default with careful consideration.

You don’t have to identify whether the site itself is using JavaScript. You can just go ahead and crawl with JavaScript rendering enabled and sites that use JavaScript will be crawled. However, you should take care, as there are a couple of big problems blindly crawling with JavaScript enabled.

First of all, JavaScript crawling is slower and more intensive for the server, as all resources (whether JavaScript, CSS, images etc.) need to be fetched to render each web page. This won’t be an issue for smaller websites, but for a large website with many thousands or more pages, this can make a huge difference.

If your site doesn’t rely on JavaScript to dynamically manipulate a web page significantly, then there’s often no need to waste time and resource.

More importantly, if you’re auditing a website you should know how it’s built and not put all your faith in any tool. JavaScript frameworks can be quite different to one another (some allow server-side rendering), and the SEO implications are very different to a traditional HTML site.

Plenty of sites are still using the old AJAX crawling scheme as well, which requires a unique set-up, and this is very different to relying purely on rendering JavaScript for crawling, indexing and scoring.

While Google can typically crawl and index JavaScript, there’s some core principles and limitations that need to be understood.

  1. All the resources of a page (JS, CSS, imagery) need to be available to be crawled, rendered and indexed.
  2. Google still require clean, unique URLs for a page, and links to be to be in proper HTML anchor tags (you can offer a static link, as well as calling a JavaScript function).
  3. They don’t click around like a user and load additional events after the render (a click, a hover or a scroll for example).
  4. The rendered page snapshot is taken at 5 seconds, so content needs to be loaded by that time, or it just won’t be indexed for each web page.

It’s essential you know these things with JavaScript SEO, as you live and die by the render in rankings.

It’s also important to remember that Google strongly advises not to rely purely on JavaScript and recommend developing with progressive enhancement, building the site’s structure and navigation using only HTML and then improving the site’s appearance and interface with AJAX.

For further reading on JavaScript SEO, I highly recommend a guide on the core principles from Justin Briggs and SEO guides on progressive web apps, Angular.JS, Universal Angular 2.0 and React.JS by Builtvisible.

I also recommend having a read of Bartosz Góralewicz’s guide to crawling JavaScript, which touches upon some of the points in this guide.

The purpose of this guide is not actually to go into lots of detail about JavaScript SEO, but more specifically, how to identify and crawl JavaScript websites effectively using our Screaming Frog SEO Spider software.

How To Identify JavaScript Sites

Identifying a site built using a JavaScript framework can be pretty simple, however, identifying sections, pages or just smaller elements which are dynamically adapted using JavaScript can be far more challenging.

There’s a number of ways you’ll know whether the site is built using a JavaScript framework.

Crawling

This is a start point for many, and you can just go ahead and start a crawl of a website with the standard configuration. By default, the SEO Spider will crawl using the ‘old AJAX crawling scheme’, which means JavaScript is disabled, but the old AJAX crawling scheme will be adhered to if set up correctly by the website.

If the site uses JavaScript and is set up with escaped-fragment (#!) URLs and HTML snapshots as per Google’s old AJAX crawling scheme, then it will be crawled and URLs will appear under the ‘AJAX’ tab in the SEO Spider. This tab only includes pages using the old AJAX crawling scheme specifically, not every page that uses AJAX.

JavaScript site using the old AJAX crawling scheme

The AJAX tab shows both ugly and pretty versions of URLs, and like Google, the SEO Spider fetches the ugly version of the URL and maps the pre-rendered HTML snapshot to the pretty URL. Some AJAX sites or pages may not use hash fragments, so the meta fragment tag can be used to recognise an AJAX page for crawlers.

If the site is built using JavaScript but doesn’t adhere to the old crawling scheme or pre-render pages, then you may find only the homepage is crawled with a 200 OK response and perhaps a couple of JavaScript and CSS files, but not much else.

crawling JS without rendering!

You’ll probably also find that the page has virtually no ‘outlinks’ in the tab at the bottom of the tool, as they are not being rendered and hence can’t be seen.

javascript rendering outlinks

In the example screen shot above, the ‘outlinks’ tab in the SEO Spider shows JS and CSS files on the page only.

Client Q&A

This should really be the first step. One of the simplest ways to find out about a website is to speak to the client and the development team and ask the question. What’s the site built in? What CMS is it using, or is it bespoke?

Pretty sensible questions and you might just get a useful answer.

Disable JavaScript

You can turn JavaScript off in your browser and view content available. This is possible in Chrome using the built-in developer tools, or if you use Firefox, the web developer toolbar plugin has the same functionality. Is content available with JavaScript turned off? You may just see a blank page.

JavaScript disabled

Typically it’s also useful to disable cookies and CSS during an audit as well to diagnose for other crawling issues that can be experienced.

Audit The Source Code

A simple one, by right clicking and viewing the raw HTML source code. Is there actually much text and HTML content? Often there are signs and hints to JS frameworks and libraries used. Are you able to see the content and hyperlinks rendered in your browser within the HTML source code?

You’re viewing code before it’s processed by the browser and what the SEO Spider will crawl, when not in JavaScript rendering mode.

If you run a search and can’t find them within the source, then they will be dynamically generated in the DOM and will only be viewable in the rendered code.

source code of a JS site

If the body is pretty much empty like the above example, it’s a pretty clear indication.

Audit The Rendered Code

How different is the rendered code to the static HTML source? By right clicking and using ‘inspect element’ in Chrome, you can view the rendered HTML.

Rendered HTML source code

You will find that the content and hyperlinks are in the rendered code, but not the original HTML source code. This is what the SEO Spider will see, when in JavaScript rendering mode.

By clicking on the opening HTML element, then ‘copy > outerHTML’ you can compare the rendered source code, against the original source.

Toolbars & Plugins

Various toolbars and plugins such as the BuiltWith toolbar, Wappalyser and JS library detector for Chrome can help identify the technologies and frameworks being utilised on a web page at a glance.

These are not always accurate, but can provide some valuable hints, without much work.

Manual Auditing Is Still Required

These points should help you identify sites that are built using a JS framework fairly easily. However, further analysis is always recommended to discover JavaScript elements, with a manual inspection of page templates, auditing different content areas and elements which might require user interaction.

We see lots of e-commerce websites relying on JavaScript to load products onto category pages, which is often missed by webmasters and SEOs until they realise product pages are not being crawled in standard (non-rendering) crawls.

Additionally, you can support a manual audit by crawling a selection of templates and pages from across the website, with JavaScript both disabled and enabled, and analysing any differences in elements and content. Sometimes websites use variables for elements like titles, meta tags or canonicals, which are extremely difficult to pick up by the eye only.

I recommend reading Justin Briggs’s guide to auditing JavaScript for SEO, which goes into far more practical detail about this analysis phase.

How To Crawl JavaScript Using The SEO Spider

Once you have identified JavaScript you want to crawl, next you’ll need to configure the SEO Spider to JavaScript rendering mode to be able to crawl it.

The following 7 steps should help you configure a crawl for most cases encountered.

1) Configure Rendering To ‘JavaScript’

To crawl a JavaScript website, open up the SEO Spider, click ‘Configuration > Spider > Rendering’ and change ‘Rendering’ to ‘JavaScript’.

crawl with JavaScript rendering

2) Check Resources & External Links

Ensure resources such as images, CSS and JS are ticked under ‘Configuration > Spider’.

If resources are on a different subdomain, or a separate root domain, then ‘check external links‘ should be ticked, otherwise they won’t be crawled and hence rendered either.

check resources

This is the default configuration in the SEO Spider, so you can simply click ‘File > Default Config > Clear Default Configuration’ to revert to this set-up.

3) Configure User-Agent & Window Size

You can configure both the user-agent under ‘Configuration > HTTP Header > User-Agent’ and window size by clicking ‘Configuration > Spider > Rendering’ in JavaScript rendering mode to your own requirements.

This is an optionable step, the window size is set to Googlebot’s desktop dimensions in standard configuration. Google are expected to move to a mobile-first index soon, hence if you’re performing a mobile audit you can configure the SEO Spider to mimic Googlebot for Smartphones.

rendered page screen shots

4) Crawl The Website

Now type or paste in the website you wish to crawl in the ‘enter url to spider’ box and hit ‘Start’.

Crawl the javaScript site

The crawling experience is quite different to a standard crawl, as it can take time for anything to appear in the UI to start with, then all of a sudden lots of URLs appear together at once. This is due to the SEO Spider waiting for all the resources to be fetched to render a page before the data is displayed.

5) Monitor Blocked Resources

Keep an eye on anything appearing under the ‘Blocked Resource’ filter within the ‘Response Codes’ tab. You can glance at the right-hand overview pane, rather than click on the tab specifically. If JavaScript, CSS or images are blocked via robots.txt (don’t respond, or error), then this will impact rendering, crawling and indexing.

Blocked Resource filter

Blocked resources can also be viewed for each page within the ‘Rendered Page’ tab, adjacent to the rendered screen shot in the lower window pane. In severe cases, if a JavaScript site blocks JS resources completely, then the site simply won’t crawl.

Blocked Resources JavaScript Crawling

If key resources which impact the render are blocked, then unblock them to crawl (or allow them using the custom robots.txt for the crawl). You can test different scenarios using both the exclude and custom robots.txt features.

The pages this impacts and the individual blocked resources can also be exported in bulk via the ‘Bulk Export > Response Codes > Blocked Resource Inlinks’ report.

Blocked Resources

6) View Rendered Pages

You can view the rendered page the SEO Spider crawled in the ‘Rendered Page’ tab which dynamically appears at the bottom of the user interface when crawling in JavaScript rendering mode. This populates the lower window pane when selecting URLs in the top window.

Viewing the rendered page is vital when analysing what a modern search bot is able to see and is particularly useful when performing a review in staging, where you can’t use Google’s own Fetch & Render in Search Console.

If you have adjusted the user-agent and viewport to Googlebot Smartphone, you can see exactly how every page renders on mobile for example.

JavaScript rendered crawled page screen shot

If you spot any problems in the rendered page screen shots and it isn’t due to blocked resources, you may need to consider adjusting the AJAX timeout, or digging deeper into the rendered HTML source code for further analysis.

7) Adjust The AJAX Timeout

Based upon the responses of your crawl, you can choose when the snapshot of the rendered page is taken by adjusting the ‘AJAX timeout‘ which is set to 5 seconds, under ‘Configuration > Spider > Rendering’ in JavaScript rendering mode.

ajax timeout

Our internal testing has indicated that Googlebot takes their snapshot of the rendered page at 5 seconds, which many in the industry concurred with when we discussed it more publicly last year.

Google obviously won’t wait forever, so content that you want to be crawled and indexed, needs to be available within that time, or it simply won’t be seen. We’ve seen cases of misfiring JS causing the render to load much later than 5 seconds, and entire websites plummeting in rankings due to pages suddenly being indexed and scored with virtually no content.

It’s worth noting that a crawl by our software will often be more resource intensive than a regular Google crawl over time. This might mean that the site response times are typically slower, and the AJAX timeout requires adjustment.

You’ll know this might need to be adjusted if the site fails to crawl properly, ‘response times’ in the ‘Internal’ tab are longer than 5 seconds, or web pages don’t appear to have loaded and rendered correctly in the ‘rendered page’ tab.

Closing Thoughts

The guide above should help you identify JavaScript websites and crawl them efficiently using the Screaming Frog SEO Spider tool in JavaScript rendering mode.

While we have performed plenty of research internally and worked hard to mimic Google’s own rendering capabilities, a crawler is still only ever a simulation of real search engine bot behaviour. We highly recommend using log file analysis and Google’s own fetch and render tool to fully understand what they are able to crawl, render and index, alongside a JavaScript crawler.

If you experience any problems when crawling JavaScript, or encounter any differences between how we render and crawl, and Google, we’d love to hear from you. Please get in touch with our support team directly.

  • Like us on Facebook
  • +1 us on Google Plus
  • Connect with us on LinkedIn
  • Follow us on Twitter
  • View our RSS feed

Download.

Download

Purchase a licence.

Purchase