URL rewriting

Table of Contents

General

Configuration Options

Spider Crawl Tab

Spider Extraction Tab

Spider Limits Tab

Spider Rendering Tab

Spider Advanced Tab

Spider Preferences Tab

Other Configuration Options

Tabs

URL rewriting

Configuration > URL Rewriting

The URL rewriting feature allows you to rewrite URLs on the fly. For the majority of cases, the ‘remove parameters’ and common options (under ‘options’) will suffice. However, we do also offer an advanced regex replace feature which provides further control.

URL rewriting is only applied to URLs discovered in the course of crawling a website, not URLs that are entered as the start of a crawl in ‘Spider’ mode, or as part of a set of URLs in ‘List’ mode.

Remove Parameters

This feature allows you to automatically remove parameters in URLs. This is extremely useful for websites with session IDs, Google Analytics tracking or lots of parameters which you wish to remove. For example –

If the website has session IDs which make the URLs appear something like this ‘example.com/?sid=random-string-of-characters’. To remove the session ID, you just need to add ‘sid’ (without the apostrophes) within the ‘parameters’ field in the ‘remove parameters’ tab.

remove parameters, like session IDs yo

The SEO Spider will then automatically strip the session ID from the URL. You can test to see how a URL will be rewritten by our SEO Spider under the ‘test’ tab.

url rewriting test

This feature can also be used for removing Google Analytics tracking parameters. For example, you can just include the following under ‘remove parameters’ –

utm_source
utm_medium
utm_campaign

This will strip the standard tracking parameters from URLs.

Regex Replace

This advanced feature runs against each URL found during a crawl or in list mode. It replaces each substring of a URL that matches the regex with the given replace string. The “Regex Replace” feature can be tested in the “Test” tab of the “URL Rewriting” configuration window.

url rewriting HTTP to HTTPS

Examples are:

1) Changing all links from HTTP to HTTPS

Regex: http
Replace: https

2) Changing all links to example.com to be example.co.uk

Regex: .com
Replace: .co.uk

3) Making all links containing page=number to a fixed number, eg

www.example.com/page.php?page=1
www.example.com/page.php?page=2
www.example.com/page.php?page=3
www.example.com/page.php?page=4

To make all these go to www.example.com/page.php?page=1

Regex: page=\d+
Replace: page=1

4) Removing the www. domain from any URL by using an empty ‘Replace’. If you want to remove a query string parameter, please use the “Remove Parameters” feature – Regex is not the correct tool for this job!

Regex: www.
Replace:

5) Stripping all parameters

Regex: \?.*
Replace:

6) Changing links for only subdomains of example.com from HTTP to HTTPS

Regex: http://(.*example.com)
Replace: https://$1

7) Removing the anything after the hash value in JavaScript rendering mode

Regex: #.*
Replace:

8) Adding parameters to URLs

Regex: $
Replace: ?parameter=value

This will add ‘?parameter=value’ to the end of any URL encountered

In situations where the site already has parameters this requires more complicated expressions for the parameter to be added correctly:

Regex: (.*?\?.*)
Replace: $1&parameter=value

Regex: (^((?!\?).)*$)
Replace: $1?parameter=value

These must be entered in the order above or this will not work when adding the new parameter to existing query strings.

Options

We will include common options under this section. The ‘lowercase discovered URLs’ option does exactly that, it converts all URLs crawled into lowercase which can be useful for websites with case sensitivity issues in URLs.


Join the mailing list for updates, tips & giveaways

Back to top