Log File Analyser
How to Monitor AI Bots in the Log File Analyser
Introduction
Log file analysis is often overlooked when it comes to campaign strategy, and is generally considered to be on the more advanced side of technical SEO. However, log files are an absolute gold mine of information and data, especially during an age where we’re seeing an explosion of LLMs and their AI bots.
While it may look intimidating, log file analysis isn’t as hard as it seems. The Screaming Frog Log File Analyser helps you to easily import your logs, verify search engine and AI bots, and analyse their behaviour.
What Are Log Files?
Before we dive into tracking AI bots, let’s quickly cover the basics. There are several types of log files that servers generate, including error logs, security logs, and access logs. For monitoring AI bot activity, we’re interested in access logs.
Access logs record every single request made to your website. They capture who visited, what they requested, when they requested it and how the server responded.
While tools like Google Analytics show you curated, actionable insights around visitors and their behaviour, log files capture absolutely everything in its rawest form that analytics platforms often filter out (or don’t have access to).
Each line in an access log represents a single request, containing information like:
- The IP address of the visitor
- The timestamp of the request
- The requested URL
- The HTTP status code that the server returned
- The user agent string (this is important, as it identifies the browser or bot)
- The referrer URL
While the raw data may look overwhelming, it’s incredible valuable. You can see exactly how users, search engines and bots are engaging with your site at a very granular level, allowing you to carry out many different types of analysis.
How Can Log Files Be Used for Tracking AI Bots
Traditionally, log files are used to analyse search engine bot behaviour, looking at things such as response codes, crawl frequency, crawl budgets, and more, but they also give you visibility into AI bot activity that you can’t find anywhere else.
Most AI chat and LLM platforms, such as Perplexity and OpenAI, have different bots for different purposes. The Screaming Frog Log File Analyser currently includes several presets for most major platforms:
As well as this, you’re also able to add your own custom user-agents.
This level of bot granularity unlocks even more insights, allowing you to identify what content is being used for model training, search/indexing or being surfaced in real-time to users within a chat. Alongside things like AI visibility tracking tools (which do have their caveats), this information can be used to build up a much clearer picture of your website’s visibility within AI.
Take a look at our introduction to log files for more information, as well as our guide to requesting logs from a server administrator (often jokingly referred to as the hardest part of log file analysis).
1) Importing Your Logs into the Log File Analyser
Creating a new project and importing your logs is as simple as dragging and dropping them into the Log File Analyser.
When you do, you’ll be asked to create a new project.
You can click the ‘User Agents’ tab to configure the user-agents analysed in the project. By default, the Log File Analyser only analyses search engine bot events, so the ‘Filter User Agents (Improves Performance)’ box is ticked.
For this tutorial, we’re going to select the AI User-agents and use the arrow to move them across to the ‘Selected’ pane.
You’re also able to tick ‘Verify Bots When Importing Logs (Slows down Import)’.
Search engine bots are often spoofed, and this performs a lookup against publicly confirmed IP lists (for example, Google’s and Bing’s) to confirm they are genuine. This verification currently applies to search engine bots only; AI bot verification will be added in a future update, though spoofing of AI crawlers is currently uncommon.
2) Analysing the Data
Once the import is complete, you’ll see the above Overview tab. This provides a summary of the imported log file data based on the chosen time period and the user-agent(s) selected.
The Overview shows key metrics like total events, unique URLs crawled, average bytes per request, and response time data. The charts visualise crawl activity over time, letting you quickly spot patterns in how AI bots are interacting with your site.
You can filter by specific bots and change the dates using the dropdowns in the top right. This is particularly useful for comparing how different AI bots behave, for instance, whether GPTBot (training) crawls differently to OAI-SearchBot (indexing).
Understanding the Tabs
The Log File Analyser organises data across several tabs, each providing a different view of bot activity:
URLs Tab – Shows every unique URL discovered in your logs, with metrics like request frequency, response codes, bytes transferred, and server response times per URL.
Response Codes Tab – Breaks down HTTP status codes (2XX, 3XX, 4XX, 5XX) for each URL, making it easy to identify where bots are encountering issues. The ‘Inconsistent’ column flags URLs returning different response codes across multiple requests.
User Agents Tab – Aggregates data by bot type, showing total requests, unique URLs accessed, error rates, and response code distributions for each crawler. This is particularly useful for comparing behaviour across different AI platforms.
Referers Tab – Shows where traffic originated from based on the HTTP Referer header. Most AI bot requests won’t include referer information, as they typically access sites directly rather than following links.
Directories Tab – Aggregates data by directory path, revealing which sections of your site structure AI bots are exploring.
IPs Tab – Shows activity by IP address, useful for investigating suspicious traffic or verifying bot authenticity.
Countries Tab – Displays the geographic distribution of bot traffic based on IP addresses.
Events Tab – The raw log data showing every individual request with all available attributes.
3) What to Look For
Now that you’ve imported your logs and familiarised yourself with the data, there are some key things to look for when monitoring AI bot activity.
Response Codes and Errors
A good starting point is double-checking that AI bots are successfully accessing your content. Filter by specific user-agents and look at their response code distribution in the User Agents tab.
High numbers of 4XX or 5XX errors indicate problems. For citation bots like ChatGPT-User or Perplexity-User, every error response is a missed opportunity, as these bots are trying to fetch your content in real-time to cite in a user’s conversation, and if they can’t access it, your site won’t appear in the response.
Training bots (GPTBot, ClaudeBot) encountering errors means your content isn’t making it into their models. While you might intentionally block these via robots.txt, accidental blocks through server misconfigurations or overly aggressive rules are worth double-checking.
Lastly, check the Response Codes tab for inconsistent responses. If the same URL returns 200 one day and 404 the next, investigate why this may be (e.g. hallucinating URLs, old URLs that have not been correctly redirected, etc.)
Most Visited URLs
Sort the URLs tab by ‘Num Events’ ((total requests per URL) to see which pages AI bots are accessing most frequently. This reveals what content they consider valuable.
You might find:
- Blog posts and guides getting heavy traffic from training bots
- Product pages being prioritised by citation bots
- Entire sections of your site being ignored
Compare this against what you’d expect based on your site structure and content strategy. If your most important content isn’t being crawled, you may need to improve internal linking or review your robots.txt rules.
Different bot types often show different preferences. Training bots might favour long-form informational or educational content, whilst citation bots hit specific pages triggered by user queries, which could be product and commercial pages. Use the user-agent filter to compare behaviour across different platforms.
Bandwidth and Carbon Impact
The Total Bytes column shows how much data each AI bot is consuming. For high-traffic sites, aggressive AI crawlers could have a noticeable impact on bandwidth use.
The Log File Analyser automatically calculates carbon footprint using the CO2.js library, showing emissions per URL in the Total CO2 (mg) and Average CO2 (mg) columns. The calculation uses “The Sustainable Web Design Model”, which considers data centres, network transfer, and device usage.
While analysing just bot traffic won’t show your full site emissions, it reveals the environmental cost of AI crawler activity specifically. If you’re seeing excessive bandwidth consumption from certain bots, particularly training crawlers that offer minimal return traffic, this data can justify implementing rate limiting or selective blocking using robots.txt.
We’ve seen evidence of AI crawlers making hundreds of requests per second. Beyond server performance concerns, this creates unnecessary environmental impact. The carbon metrics help quantify this, whether for internal monitoring, offsetting initiatives, or deciding which bots warrant the resource cost.
You can also adjust the project settings to analyse all user-agents (including browsers) to get a fuller picture of your website’s overall carbon footprint from both users and bots.
Aggressive Crawling Patterns
Check the Events chart on the Overview tab for unusual spikes in bot activity. Legitimate crawlers typically maintain steady, predictable patterns. Sudden surges might indicate:
- A new bot discovering your site for the first time
- Aggressive crawling that could impact server performance
- A bot ignoring crawl-delay directives
Look at Average Response Time (ms) in the User Agents tab. If specific bots correlate with slower server responses, they may be overwhelming your infrastructure. This is your signal to implement rate limiting via robots.txt crawl-delay directives or server-level controls.
The Countries tab can also reveal geographic patterns. If you’re seeing AI bot traffic from unexpected regions, it’s worth investigating further in the IPs tab to check for unusual activity, though spoofing of AI bot user-agents is currently rare compared to search engine bots.
The IPs tab lets you investigate suspicious activity further. If you see high request volumes from IPs that don’t verify, you’re likely dealing with fake bot traffic that should be blocked at server level.
Crawl Depth and Coverage
Use the Directories tab to see how deeply AI bots are exploring your site structure. Are they only hitting top-level pages, or are they discovering content several levels deep?
Poor crawl depth might indicate:
- Weak internal linking making content hard to discover
- JavaScript rendering issues (most AI bots can’t execute JS like Googlebot can)
- Robots.txt rules inadvertently blocking important sections
Compare coverage across different bot types. If training bots are crawling extensively, but citation bots barely visit, you’re contributing to model training without getting any benefits from them in regards to response visibility.
4) Combining Crawl Data with Log File Analysis
For deeper insights, you can import data from a Screaming Frog SEO Spider crawl to compare what’s on your site against what AI bots are actually accessing.
Importing SEO Spider Data
Export the ‘Internal’ tab from an SEO Spider crawl and drag it directly into the ‘Imported URL Data’ tab in the Log File Analyser. The tool will automatically match URLs between your crawl data and log file events.
Once imported, use the ‘View’ filter in the URLs or Response Codes tabs to toggle between different data views:
Matched With URL Data – Shows URLs that appear in both your crawl and your log files, with crawl data displayed alongside log metrics. This reveals which discoverable pages AI bots are actually accessing.
Not In URL Data – Shows URLs in your logs that weren’t found in your crawl. These are typically orphaned pages (not linked internally), old URLs that now redirect, or URLs from external backlinks that no longer exist on your site properly.
Not In Log File – Shows URLs from your crawl that don’t appear in your logs. These pages exist on your site but haven’t been accessed by the AI bots you’re monitoring during the log file period.
Practical Applications for AI Bot Monitoring
Finding Orphan Pages Crawled by AI Bots
The ‘Not In URL Data’ view reveals orphaned pages that AI bots are discovering and crawling, even though they’re not linked in your site structure. This could indicate external sites linking to these pages, making them discoverable despite lacking internal links.
If valuable content is orphaned but getting AI bot traffic, consider adding internal links to make it more discoverable to users as well.
Identifying Pages AI Bots Are Ignoring
The ‘Not In Log File’ view shows pages that exist in your site structure but haven’t been accessed by AI bots. This could mean:
- These pages aren’t being discovered due to poor internal linking
- The content isn’t considered valuable or relevant by AI crawlers
- Robots.txt rules are blocking access (check if intentional)
- The pages are too new to have been crawled yet within your log file timeframe
For pages you want AI bots to find, particularly high-quality content suitable for citations or training, review why they’re being overlooked.
Layering in External Link Data
You can take this analysis further by importing backlink data from tools like Ahrefs, Majestic, or Moz. Export a CSV of your backlinks with URL and link count columns, then import it into the ‘Imported URL Data’ tab.
This lets you correlate external link popularity with AI bot crawling behaviour. You might discover:
- Pages with more external links get crawled more frequently by AI bots
- Certain types of content attract both backlinks and AI bot attention
- Orphaned pages that AI bots find are actually being linked to externally, explaining how bots discovered them
This makes sense, if AI crawlers use similar discovery methods to traditional search engines, external links would logically influence what they prioritise. Testing this hypothesis with your own data reveals whether external link building impacts your AI visibility.
Understanding Crawl Efficiency
While you don’t necessarily need to import crawl data into the Log File Analyser for this, calculating what percentage of your crawlable pages are actually being accessed by AI bots can reveal further insights.
If you have 10,000 pages in your SEO Spider crawl but only 1,000 appear in your log files for AI bots, you’re seeing 10% coverage.
This metric helps assess whether you need to improve discoverability or whether AI bots are appropriately prioritising your best content. Compare this ratio across different bot types to understand varying crawl strategies.
Summary
This guide should help you use the Log File Analyser to monitor how AI bots interact with your website, from importing logs and understanding the data, to identifying patterns and combining crawl data for deeper insights.
Log file analysis reveals which content AI platforms are using for training, indexing, and real-time citations, visibility that traditional analytics tools simply can’t provide. Whether you’re tracking bandwidth consumption, spotting technical issues, or validating optimisation efforts, the Log File Analyser gives you the data to make informed decisions about AI crawler access.
Please also read our Log File Analyser user guide for more information on the tool, as well as our Log File Analyser tutorials.