However, Google in particular has evolved, deprecating their old AJAX crawling scheme guidelines of escaped-fragment #! URLs and HTML snapshots in October ’15, and are generally able to render and understand web pages like a modern-day browser.
We integrated the Chromium project library for our rendering engine to emulate Google as closely as possible.
Google have confirmed they use a web rendering service (WRS) based on Chrome 41 (M41). The SEO Spider uses a later version of Chrome, version 64 as of writing, but we recommend viewing the exact version within the SEO Spider by clicking ‘Help > Debug’ and scrolling down to the ‘Chrome Version’ line.
Hence, while rendering will obviously be similar, it won’t be exactly the same (there are some arguments that Chrome 41 itself won’t be exactly the same, either)
However, generally, the WRS supports the same web platform features and capabilities that the Chrome version it uses, and you can compare the differences between Chrome 41 and 64 at CanIUse.com.
This guide contains the following 3 sections. Click and jump to a relevant section, or continue reading.
- All the resources of a page (JS, CSS, imagery) need to be available to be crawled, rendered and indexed.
- They don’t click around like a user and load additional events after the render (a click, a hover or a scroll for example).
- The rendered page snapshot is taken at 5 seconds, so content needs to be loaded by that time, or it just won’t be indexed for each web page.
This means crawlers will be served a static HTML version of the web page for crawling and indexing. If you already have this set-up, then you can test this functionality by switching the user-agent to Googlebot within the SEO Spider.
Google also have a very useful progressive web app checklist, which covers some essential requirements for crawling and indexing of PWAs, such as using the history API instead of page fragment identifiers.
Google have a two phase indexing process, where by they initially crawl and index the static HTML, and then return later when resources are available to render the page and crawl and index content and links in the rendered HTML.
This means the crawling and indexing process is much slower, so if you rely on timely content (such as a publisher), a client-side approach is not a sensible option. It also means that elements in the original response (such as meta data and canonicals) can be used for the page, until Google gets around to rendering it when resources are available.
The AJAX tab shows both ugly and pretty versions of URLs, and like Google, the SEO Spider fetches the ugly version of the URL and maps the pre-rendered HTML snapshot to the pretty URL. Some AJAX sites or pages may not use hash fragments, so the meta fragment tag can be used to recognise an AJAX page for crawlers.
You’ll probably also find that the page has virtually no ‘outlinks’ in the tab at the bottom of the tool, as they are not being rendered and hence can’t be seen.
In the example screen shot above, the ‘outlinks’ tab in the SEO Spider shows JS and CSS files on the page only.
This should really be the first step. One of the simplest ways to find out about a website is to speak to the client and the development team and ask the question. What’s the site built in? What CMS is it using, or is it bespoke?
Pretty sensible questions and you might just get a useful answer.
Typically it’s also useful to disable cookies and CSS during an audit as well to diagnose for other crawling issues that can be experienced.
Audit The Source Code
A simple one, by right clicking and viewing the raw HTML source code. Is there actually much text and HTML content? Often there are signs and hints to JS frameworks and libraries used. Are you able to see the content and hyperlinks rendered in your browser within the HTML source code?
If you run a search and can’t find them within the source, then they will be dynamically generated in the DOM and will only be viewable in the rendered code.
If the body is pretty much empty like the above example, it’s a pretty clear indication.
Audit The Rendered Code
How different is the rendered code to the static HTML source? By right clicking and using ‘inspect element’ in Chrome, you can view the rendered HTML. You can often see the JS Framework name in the rendered code, like ‘React’ in the example below.
By clicking on the opening HTML element, then ‘copy > outerHTML’ you can compare the rendered source code, against the original source.
Toolbars & Plugins
These are not always accurate, but can provide some valuable hints, without much work.
Manual Auditing Is Still Required
The following 7 steps should help you configure a crawl for most cases encountered.
2) Check Resources & External Links
If resources are on a different subdomain, or a separate root domain, then ‘check external links‘ should be ticked, otherwise they won’t be crawled and hence rendered either.
This is the default configuration in the SEO Spider, so you can simply click ‘File > Default Config > Clear Default Configuration’ to revert to this set-up.
3) Configure User-Agent & Window Size
This is an optionable step, the window size is set to Googlebot’s desktop dimensions in standard configuration. Google are expected to move to a mobile-first index soon, hence if you’re performing a mobile audit you can configure the SEO Spider to mimic Googlebot for Smartphones.
4) Crawl The Website
Now type or paste in the website you wish to crawl in the ‘enter url to spider’ box and hit ‘Start’.
The crawling experience is quite different to a standard crawl, as it can take time for anything to appear in the UI to start with, then all of a sudden lots of URLs appear together at once. This is due to the SEO Spider waiting for all the resources to be fetched to render a page before the data is displayed.
5) Monitor Blocked Resources
If key resources which impact the render are blocked, then unblock them to crawl (or allow them using the custom robots.txt for the crawl). You can test different scenarios using both the exclude and custom robots.txt features.
The pages this impacts and the individual blocked resources can also be exported in bulk via the ‘Bulk Export > Response Codes > Blocked Resource Inlinks’ report.
6) View Rendered Pages
Viewing the rendered page is vital when analysing what a modern search bot is able to see and is particularly useful when performing a review in staging, where you can’t use Google’s own Fetch & Render in Search Console.
If you have adjusted the user-agent and viewport to Googlebot Smartphone, you can see exactly how every page renders on mobile for example.
If you spot any problems in the rendered page screen shots and it isn’t due to blocked resources, you may need to consider adjusting the AJAX timeout, or digging deeper into the rendered HTML source code for further analysis.
7) Compare Raw & Rendered HTML
This then populates the lower window ‘view source’ pane, to enable you to compare the differences, and be confident that critical content or links are present within the DOM.
8) Adjust The AJAX Timeout
Our internal testing has indicated that Googlebot takes their snapshot of the rendered page at 5 seconds, which many in the industry concurred with when we discussed it more publicly last year.
Our tests indicate Googlebot is willing to wait (approx) 5 secs for their snapshot of rendered content btw. Needs to be in well before then.
— Screaming Frog (@screamingfrog) October 13, 2016
Google obviously won’t wait forever, so content that you want to be crawled and indexed, needs to be available within that time, or it simply won’t be seen. We’ve seen cases of misfiring JS causing the render to load much later than 5 seconds, and entire websites plummeting in rankings due to pages suddenly being indexed and scored with virtually no content.
It’s worth noting that a crawl by our software will often be more resource intensive than a regular Google crawl over time. This might mean that the site response times are typically slower, and the AJAX timeout requires adjustment.
You’ll know this might need to be adjusted if the site fails to crawl properly, ‘response times’ in the ‘Internal’ tab are longer than 5 seconds, or web pages don’t appear to have loaded and rendered correctly in the ‘rendered page’ tab.
While we have performed plenty of research internally and worked hard to mimic Google’s own rendering capabilities, a crawler is still only ever a simulation of real search engine bot behaviour.