What I Learnt from Analysing 7 Million Log File Events
What I Learnt from Analysing 7 Million Log File Events
Table of Contents
- What You’ll Learn From My Research
- How Often Is Googlebot Crawling the robots.txt File?
- Is There a Way to See in Log Files Where Googlebot Is Coming From?
- How Many Different IP Addresses Does Googlebot Have?
- Which Pages Are Googlebot Crawling More Often?
- Are 404 Status Code Pages Crawled More Than Three Times?
- How Long Does Googlebot Crawl a 301-Redirect For?
- How Often Does Googlebot Crawl Paginated Pages?
- Should Higher Traffic Pages Be Crawled More by Googlebot?
- Is It Important to Have a Lot of Products in a Category to Get More Attention From Googlebot?
- How Important Is Page Speed in the Context of Crawling Events?
- Does Rebooting Your Machine Make Your Log File Analysis Faster?
- Last Words
It’s been almost three years since I started to work intensely on log file analysis for SEO purposes, and to make the analysis simple, I decided on the Screaming Frog Log File Analyser as my tool of choice.
Because I faced so many cases where I should have known better, I decided to start noting down everything I learnt in the last ten months when working on a project. This project involved looking at server log files and then optimising the crawl budget based on findings.
This blog post is not about an SEO case study, but more of a research and learnings piece based on my experience of over 7.2 million user agent events.
The only user agent I tracked during this research is Googlebot’s, which had 124 different user agents and several thousand real IP addresses.
The website I was analysing is an eCommerce site, which has over 100,000 indexed pages, and by delving into and understanding the website’s log files, I found a lot of opportunities to optimise the crawl budget and positively impact the revenue of the brand.
If you’re not already familiar with server log files, the below resources are a great starting point to gain a better understanding of how they work, and how they can be leveraged for SEO insight.
- What Is A Server Log File? by Ian Lurie
- The Ultimate Guide to Log File Analysis for SEO by Daniel Butler
You can also find lots of additional information in Screaming Frog’s Log File Analyser user guide.
What You’ll Learn From My Research
Some of the insights I gained from this task include:
- How often is Googlebot crawling the robots.txt file?
- Is there a way to see in the log files where Googlebot is coming from?
- How many different IP addresses does Googlebot have?
- Which Pages Are Googlebot Crawling More Often?
- Are 404 status code pages crawled more than three times?
- How many times does Googlebot crawl a 301-redirect?
- How often does Googlebot crawl paginated pages?
- Should higher traffic pages be crawled more by Googlebot?
- Is it important to have a lot of products in a category to get more attention from Googlebot?
- How important is page speed in the context of crawling events?
- Does rebooting your machine make your log file analysis faster?
How Often Is Googlebot Crawling the robots.txt File?
This is one of the most challenging parts of log file analysis, as there’s no exact number of times that Googlebot will crawl your robots.txt. But in this website’s case, Googlebot crawled the robots.txt file between 6 to 60 times per day. In 10 months, Googlebot crawled the robots.txt file over 24 thousand times – which breaks down as around 2,400 times per month.
The most common IP addresses to crawl the robots.txt file were:
- 18.104.22.168 (ISP: Google – United States)
- 22.214.171.124 (ISP: Google – United States)
It’s important to remember that this website has almost 100,000 indexed pages. In some cases, such as if the website is small, the robots.txt will be crawled at a much lower rate.
Don’t forget that Googlebot uses robots.txt to know which pages should or shouldn’t be crawled, so it’s important to ensure it’s reachable to Googlebot.
Is There a Way to See in Log Files Where Googlebot Is Coming From?
The short answer is yes.
Within log files you can see the referrer, which tells us that Googlebot found our site while crawling another.
You won’t see every backlink in this list, but there’s a chance you can catch some weird things that other tools may miss, such as your website’s IP address being indexable. In our case, the website had eight indexable IP addresses. If the original website has 100K indexed pages, then it makes 800K duplicate pages because of these IP-based domains.
As SEOs, we must inform developers about instances like this, and the developers will usually prevent those IP addresses from being crawlable.
Not every website will have the same problem, but it’s still something you must be aware of.
How Many Different IP Addresses Does Googlebot Have?
If we consider that there might be some bots that can impersonate even Googlebot, it might be good to start with how many real user agents of Googlebot I had to face in this period of 10 months.
There’s about 124 different user agents with a total of 7,231,982 events from Googlebot. This year, Google published their IP addresses of Googlebot as a way to identify if it’s legitimately them or not. Over the period of 10 months of log file analysis, I faced thousands of different IP addresses.
What can you do with these thousands of IP addresses?
- 1. Use them to verify if the Googlebot’s IP is real or not
- 2. You can check which country Googlebot is crawling your website from
- 3. It becomes easier to filter server-side by IP address to find Googlebot verified logs
All the verifiable IP addresses start with “66.249…”, and if you want to see the IPs I exported from 2 user agents of Googlebot, here’s the list of them.
If you use AWS and want to verify Googlebot, you can use a filter and match those IP addresses. Usually, it’s not an SEO job; you can ask the developer.
The Screaming Frog Log File Analyser also allows you to verify all major search engine bots right within the tool:
Which Pages Are Googlebot Crawling More Often?
It’s no surprise that, yes, in our case the homepage was the most crawled page, as it should be.
But you might be surprised to hear the rest of the most crawled pages:
- 1. Image* – which is an icon that is used on almost every page.
- 2. Robots.txt file
- 3. Header Image – the brand used it on mobile devices at the top of the homepage.
- 4. Error page – if your website automatically redirects a 404 page to the same “Error Page”, I suggest you solve it. In these instances, finding which page is 404’ing will be very difficult, but we solved this problem by leaving 404 pages on the same URL that was removed.
*These images are uploaded under the root domain. The brand uses a CDN, so the images that have particular URL addresses are not included in these log files.
Are 404 Status Code Pages Crawled More Than Three Times?
As far as I see, 404 pages can be crawled unlimited times.
Why might this be the case?
- 1. There are still websites that refer to that old page – the best practice here is to contact the website owner and let them know about the replaced page.
- 2. It’s internally linked and hasn’t yet been replaced – even if you redirected that old page, you must remove the links to it from every page.
- 3. It’s included in a static XML sitemap – ideally a sitemap is updated dynamically so that pages are automatically added and removed.
- 4. Div style background URLs won’t update themselves – it’s best to check the source code to ensure there are no broken images.
How Long Does Googlebot Crawl a 301-Redirect For?
For this question, in my experience, hundreds of 301-redirected pages were crawled for over nine months.
One of the reasons is that those old pages (which now are redirected) are internally linked. Some of them have fewer crawl events because Googlebot started to understand which one is the new page.
Some of them had external links that couldn’t be updated, so there was little to do but wait for Google to recognise the new links.
I had some other cases where Googlebot crawled dozens of pages that were over 3 years old and returned a 404-response code. This is not unusual for large sites, and it’s not necessary to take any action. You won’t lose any value from the crawl budget for your website, and it’s not necessary to 301-redirect every single 404 page. Just make sure that the old page wasn’t bringing any value for users at the time.
How Often Does Googlebot Crawl Paginated Pages?
It’s almost impossible to say an exact number for this question, but on average when looking at events for the second page, I can say that it’s between 1 week to 1 month. Googlebot usually uses the pagination series for understanding the relationship between categories.
One thing to remember is, your website will never appear in the top ten of the search results with the second page of a category. But in any case, for the sake of “best practice”, you can keep indexable pagination pages or whatever structure you use for levelling the depth of categories.
Should Higher Traffic Pages Be Crawled More by Googlebot?
The short answer to this question is yes and no.
One of my learnings from this project’s log file analysis is that it doesn’t matter how many clicks a page gets. It’s more important to ensure that those pages are reachable, fast, and provide the intent that users are looking for.
So why is the answer also yes, then? Because there’s a homepage that has the most clicks and events. The homepage should have the most events, but not necessarily the most clicks. So, in this case, this is a typical result.
The most crawled pages by user clicks in the last ten months:
- 1. Homepage (Level 1) – over 60,000 events and 2,640,000 clicks.
- 2. Main Category (linked on every page – Level 2) – 4,260 events and over 8,200 clicks.
- 3. Brand Category (linked in faceted navigation of a sub-category – Level 4) – 3,400 events and over 5,000 clicks.
- 4. Sub-Category (linked in the main category – Level 3) – over 3,300 events and 4,300 clicks.
- 5. Chocolate Product (linked in the sub-category and brand category – Level 5) – over 3,200 events and 2,400 clicks.
The most clicked top 5 pages and how many events they have:
- 1. Homepage
- 2. Catalogue (linked on every page – Level 2) – 612 events and over 479,000 clicks.
- 3. Seasonal Sub-Category (linked in a Level 3 sub-category – Level 4) – 415 events and over 129,000 clicks.
- 4. Sub-Category (linked in the main category – Level 3) – 478 events and over 119,000 clicks.
- 5. Deals & Sales (linked on every page – Level 2) – over 1,600 events and 79,000 clicks.
Level – A definition of how many steps away from the homepage, referred to within the Screaming Frog Spider Tool as Crawl Depth.
Based on my findings, I feel that it’s not a “crawling signal” if a page getting a lot of clicks. As you can see above, the most crawled versus the most clicked pages are different from each other (excluding the homepage).
Is It Important to Have a Lot of Products in a Category to Get More Attention From Googlebot?
No. Even if they have 0 products in some of the categories, Googlebot will crawl it as much as other pages. In this case, I saw a lot of categories that had a meagre amount of products, but they were crawled periodically. It might be because of reasons such as the following:
- 1. Since those categories don’t return a 304 status code (not modified), Googlebot is trying to see if there’s any update in the product listing.
- 2. It’s included in the XML sitemap, and Googlebot loves to crawl this file very often. Some of the URLs which are included in the XML sitemap are not linked internally because the brand usually hides the categories that have no products in them.
- 3. These categories are linked on every page due to the header or footer menu. I would suggest everyone checks all links that are included on every page of the website.
How Important Is Page Speed in the Context of Crawling Events?
I asked my colleague Adem for his opinion on how important page speed is as metric in the context of crawling events, since he is experienced when it comes page speed optimisation. He said that the Largest Contentful Paint (LCP) is the one we must care about more than other Core Web Vitals metrics.
Well, I believe him. Not because he’s my colleague, but because his approach on page speed is intense. In his experience, almost all the page speed metrics affect themselves. If the Time to First Byte (TTFB) metric is bad, then the First Contentful Paint (FCP) is too. If the FCP metric is bad, optimising any parts of the website will greatly affect the Largest Contentful Paint metric. That’s why the LCP metric is the most challenging one, and that’s why we still don’t have the FCP metric in Google Search Console.
I compared how LCP affects Googlebot’s crawl events, and here’s what I learned:
- Googlebot will crawl those pages anyway, if it has a low score of LCP or not.
- The most crawled five pages have between 4 and 19 seconds of LCP load.
- In some cases, Googlebot tried to crawl the pages with a lousy LCP score more often.
- Pages with better LCP scores have more user clicks than others with bad scores.
- If a page is interactive for the users, it doesn’t matter what the metrics say. Users are the ones who choose where to spend time.
Does Rebooting Your Machine Make Your Log File Analysis Faster?
Yes. Do it often. Your CPU will thank you whenever you do log file analysis.
Finally, it would be a crime if I didn’t mention the references I used heavily during this project: