SEO Spider
How To Use Vector Embeddings for Redirect Mapping
Introduction
The SEO Spider allows you to identify semantically similar pages and outliers on a site using vector embeddings.
While not purpose-built for this task, if you’re performing a site migration and need to implement redirects from old URLs to new URL equivalents, this feature can be helpful in finding the closest matching neighbour to utilise for redirect mapping.
This functionality goes beyond matching text (as seen in our duplicate content detection) by utilising LLM embeddings that better understand language. This means it can be used to find the closest match for pages, even when the content isn’t the same.
Please note: These features require a paid licence for the software. It’s important to remember that this is not a purpose-built redirect mapping feature and won’t be the right fit for all scenarios.
This tutorial walks you through how to connect to AI providers for embeddings, enable semantic similarity analysis and find the closest matching pages that can be utilised for redirect mapping.
1) Select An AI Provider for Embeddings
To perform semantic content analysis, you will need to connect to an AI provider API to generate vector embeddings of crawled page content.
Select from OpenAI, Gemini & Ollama via ‘Config > API Access > AI’.

Ensure you have set up an account and have an API key, as outlined in the guides above.
2) Add Embeddings Prompt From Library
When you have selected your AI provider, navigate to the ‘Prompt Configuration’, select ‘Add from Library’ and choose the relevant preset for embeddings.
The Gemini embeddings and API are recommended and used in our example. For Gemini, select ‘Extract Semantic Embeddings from Page’, which will be added as a prompt.

The ‘Extract Semantic Embeddings from Page’ prompt specifically uses the ‘SEMANTIC_SIMILARITY’ task type, which suits this analysis.

The prompt will be displayed, with an error message that ‘Store HTML‘ must also be configured. More on this shortly.
Tip! ‘Page Text’ is based upon the content area settings. This automatically excludes the nav and footer elements of a page, but you can customise it so it provides the exact content you wish to send to the LLM.
3) Connect To the API
Before enabling the Store HTML setting, remember to ‘Connect’ to the API under ‘Account Information’.

This means when you start the crawl, embeddings will be generated and displayed in the AI tab.
4) Select ‘Store HTML’ & ‘Store Rendered HTML’
Click ‘Config > Spider > Extraction’ and enable ‘Store HTML’ and ‘Store Rendered HTML’ so page text is stored and used for vector embeddings.

The raw HTML page text will be used for crawls in text-only mode, while the rendered HTML page text will be used in JavaScript rendering mode.
5) Enable Embeddings Functionality
Navigate to the embeddings configuration via ‘Config > Content > Embeddings’ and ‘Enable Embedding functionality’.
The prompt set up should automatically be displayed in the embedding prompt dropdown.
Enable ‘Semantic Similarity’ to populate the relevant columns and filters in the Content tab.

You can also enable ‘Low Relevance’, if you’re interested in identifying pages that deviate from the average content on the site.
6) Disable Crawling Resources & External Links
If you’re not using JavaScript rendering mode where resources are required for rendering, for brevity you can disable the crawling of resources and external links.
These settings are located in ‘Config > Spider > Crawl’.

This will reduce wasting time crawling unnecessary URLs.
7) Crawl Old & New Sites Together
Switch to list mode, by clicking ‘Mode > List’ in the top level menu.

Then click ‘Config > Spider > Limits’ and disable the ‘Limit Crawl Depth‘ setting that gets applied automatically in list mode.

This means it will crawl all URLs found on the subdomains entered.
Now click ‘Upload > Enter Manually’ and input the old and new websites you wish to perform redirect mapping for.

Click ‘Next’ and wait until the crawl and API progress bar reaches 100%.
8) Run Crawl Analysis
To populate the ‘Semantically Similar’ and ‘Low Relevance Content’ filters in the Content tab (and associated columns), you need to perform crawl analysis when the crawl has completed.
There is an icon and message displayed next to filters that require crawl analysis displayed in the right-hand Overview tab.

To run crawl analysis, just click ‘Crawl Analysis > Start’ in the top menu.

The crawl analysis progress bar will appear in the top right hand corner, and when it reaches 100%, it’s time to analyse the data.

You are able to select to run this automatically at the end of the crawl to avoid this step in the future via ‘Crawl Analysis > Configure’ and selecting ‘Auto-Analyse at End of Crawl’.
9) View ‘Closest Semantically Similar Address’
Navigate to the Content tab, and view the ‘Closest Semantically Similar Address’ column to see the most similar URLs for each crawled URL, that can be used for redirect mapping.

Use the search bar to filter for one of the domains, to refine the results for redirect mapping the way that makes sense for you.

These results can be exported using the ‘Export’ button on the UI, and used as the base for redirect mapping.
Validate Results
The results should not be trusted blindly, and obviously require manual review and validation.
The ‘Semantic Similarity Score’ column shows how similar the ‘Closest Semantically Similar Address’ is to the URL Address. Semantic similarity scores range from 0 – 1. The higher the score, the higher the similarity to the closest semantically similar address.
We recommend validating all results by manually reviewing them, while paying special attention to the lowest ‘Semantic Similarity Score’ by ordering the column.

Any URL below the 0.95 semantic similarity threshold (set in ‘Config > Content > Embeddings) will also have ‘0’ in the ‘No. Similar Similar Column’.
In this case, the blog post and author are only on the live site, not on staging! So a match wasn’t found, and instead the closest matching URL was a blog category on the existing subdomain, rather than on staging.
It’s important to remember, that if there is a better match on the same subdomain, it can match to the same subdomain – which obviously wouldn’t be ideal for redirect mapping. Therefore, we recommend using the search function to identify any same subdomain matches –


These can be reviewed and updated manually.
10) Use Duplicate Details for Alternatives
The number of URLs that are semantically similar to a URL can be viewed in the ‘No. Semantically Similar’ column. While the ‘Closest Semantically Similar Address’ only shows a single URL, the lower ‘Duplicate Details’ tab and ‘Semantic Similarity’ filter will list all URLs that are semantically similar.

These can be manually reviewed to consider if they might be better matches if the closest similar address is deemed not ideal. In the example above, it isn’t the case.
‘Semantically Similar’ URLs can be exported in bulk via the ‘Bulk Export > Content > Semantically Similar’ export.

This will include the closest semantically similar address and any others above the threshold.
Alternative Options
This feature is not purpose-built for redirect mapping, so it might not always deliver perfect results depending on the scenario. That said, most redirect mapping is imperfect and this can help perform some heavy-lifting.
Alternative options include –
- Utilising Mark Williams-Cook’s AI powered Google Colab Python script which utilises MiniLM-L6-v2 and FAISS models for redirect mapping.
- Using fuzzy matching as outlined by Lazarina Stoy in the Ultimate End-to-End Guide to Fuzzy Matching For SEOs.
- Paid alternatives, such as Rapid301 which uses its own unique algorithm for redirect mapping.
Summary
The guide above should illustrate how to use the SEO Spider to find duplicate and semantically similar pages in a crawl for redirect URL mapping.
This is not a purpose-built feature, so we recommend trialing what works for you in your own scenario and utilising alternative options where appropriate.
Please also read our Screaming Frog SEO Spider FAQs and full user guide for more information on the tool. Get in touch with our support with any queries.