Validation

Dan Sharp

Posted 5 December, 2022 by Dan Sharp in

Validation

The validation tab performs some basic best practice validations that can impact crawlers when crawling and indexing. This isn’t W3C HTML validation which is a little too strict, the aim of this tab is to identify issues that can impact search bots from being able to parse and understand a page reliably.

Columns

This tab includes the following columns.

Address – The URL address.
Content – The content type of the URL.
Status Code – The HTTP response code.
Status – The HTTP header response.
Indexability – Whether the URL is Indexable or Non-Indexable.
Indexability Status – The reason why a URL is Non-Indexable. For example, if it’s canonicalised to another URL.

Filters

This tab includes the following filters.

Invalid HTML Elements In <head> – Pages with invalid HTML elements within the <head>. When an invalid element is used in the <head>, Google assumes the end of the <head> element and ignores any elements that appear after the invalid element. This means critical <head> elements that appear after the invalid element will not be seen. The <head> element as per the HTML standard is reserved for title, meta, link, script, style, base, noscript and template elements only.
<body> Element Preceding <html> – Pages that have a body element preceding the opening html element. Browsers and Googlebot will automatically assume the start of the body and generate an empty head element before it. This means the intended head element below and its metadata will be seen in the body and ignored.
<head> Not First In <html> Element – Pages with an HTML element that proceed the <head> element in the HTML. The <head> should be the first element in the <html> element. Browsers and Googlebot will automatically generate a <head> element if it’s not first in the HTML. While ideally <head> elements would be in the <head>, if a valid <head> element is first in the <html> it will be considered as part of the generated <head>. However, if non <head> elements such as <p>, <body>, <img> etc are used before the intended <head> element and its metadata, then Google assumes the end of the <head> element. This means the intended <head> element and its metadata may only be seen in the <body> and ignored.
Missing <head> Tag – Pages missing a <head> element within the HTML. The <head> element is a container for metadata about the page, that’s placed between the <html> and <body> tag. Metadata is used to define the page title, character set, styles, scripts, viewport and other data that are critical to the page. Browsers and Googlebot will automatically generate a <head> element if it’s omitted in the markup, however it may not contain meaningful metadata for the page and this should not be relied upon.
Multiple <head> Tags – Pages with multiple <head> elements in the HTML. There should only be one <head> element in the HTML which contains all critical metadata for the document. Browsers and Googlebot will combine metadata from subsequent <head> elements if they are both before the <body>, however, this should not be relied upon and is open to potential mix-ups. Any <head> tags after the <body> starts will be ignored.
Missing <body> Tag – Pages missing a <body> element within the HTML. The <body> element contains all the content of a page, including links, headings, paragraphs, images and more. There should be one <body> element in the HTML of the page. Browsers and Googlebot will automatically generate a <body> element if it’s omitted in the markup, however, this should not be relied upon.
Multiple <body> Tags – Pages with multiple <body> elements in the HTML. There should only be one <body> element in the HTML which contains all content for the document. Browsers and Googlebot will try to combine content from subsequent <body> elements, however, this should not be relied upon and is open to potential mix-ups.
HTML Document Over 15MB – Pages which are over 15MB in document size. This is important as Googlebot limit their crawling and indexing to the first 15MB of an HTML file or supported text-based file. This size does not include resources referenced in the HTML such as images, videos, CSS, and JavaScript that are fetched separately. Google only considers the first 15MB of the file for indexing and stops crawling afterwards. The file size limit is applied on the uncompressed data. The median size of an HTML file is about 30 kilobytes (KB), so pages are highly unlikely to reach this limit.
Resource Over 15mb – JavaScript and CSS files which are over 15mb in size. This is important as Googlebot limit their crawling and indexing to the first 15MB of a file. This means anything beyond this limit in JavaScript or CSS might not be processed. Google only considers the first 15MB of the file for indexing and stops crawling afterwards. The file size limit is applied on the uncompressed data.
High Carbon Rating – Pages that have a carbon rating of F using the digital carbon ratings system from Sustainable Web Design. This scale equates page weight tracked by the HTTP Archive with CO2 estimates per page view. The CO2 calculation uses the ‘The Sustainable Web Design Model’ for calculating emissions, which considers datacentres, network transfer and device usage in calculations.

For more on Invalid HTML Elements In <head>, please read our guide on ‘How To Debug Invalid HTML Elements In The Head‘.

Dan Sharp

Dan Sharp is founder & Director of Screaming Frog. He has developed search strategies for a variety of clients from international brands to small and medium-sized businesses and designed and managed the build of the innovative SEO Spider software.

Comments are closed.

Validation

Validation

Columns

Filters

Get in touch

Back to top

Validation

Validation

Columns

Filters

Join the mailing list for updates, tips & giveaways

Get in touch

Back to top

SEO Spider v.22.2

SEO Spider v.22.2

SEO Spider v.22.2

Log File Analyser v.6.3

Log File Analyser v.6.3

Log File Analyser v.6.3

Support Ticket

Support Ticket

Training Request