Introduction To Log Files
The purpose of this guide is to give an introduction to log files for SEO analysis. After reading this guide you’ll know what log files are, what they contain, and what they look like.
When we talk about analysing web server logs, we’re talking about the access logs. Web servers produce other log files, such as error logs, but for SEO analysis we need the access log. The access log records all access to a web site.
Typically there will be one line per request, and this will include some or all of the following:
- Time the request was made
- URL requested
- User agent supplied when making the request
- Response Code
- Size of response
- IP address of client making the request
- Time taken to serve the request
- Referrer, the page that provided the link to make this request
Having an accurate record of all requests a web server had received is incredibly useful, allowing for some powerful analysis. Logs file analysis has some distinct advantages over crawls:
- You can view exactly which URLs have been requested by the search engines. A crawl is a simulation, where as log files can show you exactly what has been crawled.
- Issues are quantified. If you find several 404’s on your site during a crawl, it’s hard to tell which ones are the most important. By analysing the log files you’ll be able to see how often they’ve occurred, which will help you to prioritise fixing them.
- See orphaned URLs. If a URL is not linked to internally, it won’t be found by crawling the site. URLs that are linked to only externally (or historically) will be reported in logs.
- View changes over time. A crawl is a snapshot of now, Log Files provide historical information of every event.
- See what users actually experienced, rather than how a crawler saw your site.
Web servers record every access, so on a busy site the logs can soon become very large. Rather than losing data, log files are typically rotated, or archived. This may be triggered on the size of the log file, say every time it hits 10MB, or at fixed time intervals (daily, monthly etc). Given there is so much repetition in a log file (same urls, response codes over and over), they compress down really well.
The Log File Analyser supports two log file formats; Apache, produced by Apache and Nginx web servers, and W3C, produced by Microsoft IIS. This represents about 95% of the web server market today. So it’s highly likely, if you have the original web server log files, you’ll be able to import them into the Log File Analyser. Both these log formats are just plain text, so if you’ve have a log file, it’s very easy check it’s actually an access log.
Here’s an example line from an Apache log file:
127.0.0.1 - - [01/Jan/2016:12:00:00 +0000] "GET http://www.example.com/about/ HTTP/1.1" 200 512 "http://www.example.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
Reading from left to right we can see:
- 127.0.0.1 made the HTTP request
- The request was on Jan 1st 2016 at 12pm
- The URL requested was http://www.example.com/about/
- The response code was 200
- The size of the response was 500 bytes
- The previous (refering) URL was http://www.example.com/
- The User Agent was Chrome on Mac OS X
If you’d like to learn more, check out An SEOs Guide To Apache Log Files.
Here’s the same request in W3C format:
#Software: Microsoft Internet Information Services 6.0
#Date: 2002-05-24 20:18:01
#Fields: date time c-ip cs-username cs-method cs-uri-stem cs-uri-query sc-status sc-bytes cs(User-Agent) cs(Referrer)
2016-01-01 12:00:00 127.0.0.1 - GET /about/ - 200 512 Mozilla/5.0+(Macintosh;+Intel+Mac+OS+X+10_10_5)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/51.0.2704.103+Safari/537.36 http://www.example.com/
Here the the file starts with some meta data detailing what fields will be present on each line and in what order. The line containing the actual record of the request contains the same information as the Apache line, but is composed of different data types. For example, the time of the access is recorded as two separate fields, rather than as one.
If you’d like to learn more, check out An SEOs Guide To W3C Log Files.
Amazon Elastic Load Balancing
Here’s the same request from an ELB log:
2017-01-01T09:00:00.00 my-elb 220.127.116.11:2817 10.0.0.1:80 0.000500 0.000500 0.000057 200 200 0 29 "GET https://www.example.com:443/contact.html HTTP/1.1" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" DHE-RSA-AES128-SHA TLSv1.2
If you’d like to learn more, check out An SEOs Guide To Amazon Elastic Load Balancing Log Files .
Hopefully this has given you a an idea of what information can be found in log files, what they look like, what what kind of analysis you can do with the Log File Analyser.
If you need further inspriation on analysing log files, have a read of our 22 Ways To Analyse Logs Using The Log File Analyser guide.