Exclude

si digital

Posted 27 November, 2015 by si digital in

Exclude

Configuration > Exclude

The exclude configuration allows you to exclude URLs from a crawl by using partial regex matching. A URL that matches an exclude is not crawled at all (it’s not just ‘hidden’ in the interface). This will mean other URLs that do not match the exclude, but can only be reached from an excluded page will also not be found in the crawl.

The exclude list is applied to new URLs that are discovered during the crawl. This exclude list does not get applied to the initial URL(s) supplied in crawl or list mode.

Changing the exclude list during a crawl will affect newly discovered URLs and it will applied retrospectively to the list of pending URLs, but not update those already crawled.

Matching is performed on the URL encoded version of the URL. You can see the encoded version of a URL by selecting it in the main window then in the lower window pane in the details tab looking at the ‘URL Details’ tab, and the value second row labelled “URL Encoded Address”.

Here are some common examples –

To exclude a specific URL or page the syntax is:
http://www.example.com/do-not-crawl-this-page.html
To exclude a sub directory or folder the syntax is:
http://www.example.com/do-not-crawl-this-folder/
To exclude everything after brand where there can sometimes be other folders before:
http://www.example.com/.*/brand.*
If you wish to exclude URLs with a certain parameter such as ‘?price’ contained in a variety of different directories you can simply use (Note the ? is a special character in regex and must be escaped with a backslash):
\?price
To exclude anything with a question mark ‘?’(Note the ? is a special character in regex and must be escaped with a backslash):
\?
If you wanted to exclude all files ending jpg, the regex would be:
jpg$
If you wanted to exclude all URLs with 1 or more digits in a folder such as ‘/1/’ or ‘/999/’:
/\d+/$
If you wanted to exclude all URLs ending with a random 6 digit number after a hyphen such as ‘-402001’, the regex would be:
-[0-9]{6}$
If you wanted to exclude any URL with ‘exclude’ within them, the regex would be:
exclude
Secure (https) pages would be:
https
Excluding all pages on http://www.domain.com would be:
http://www.domain.com/
If you want to exclude a URL and it doesn’t seem to be working, its probably because it contains special regex characters such as ?. Rather trying to locate and escape these individually, you can escape the whole line starting with \Q and ending with \E as follow:
\Qhttp://www.example.com/test.php?product=special\E
Remember to use the encoded version of the URL. So if you wanted to exclude any URLs with a pipe |, it would be:
%7C
If you’re extracting cookies, which removes the auto exclude for Google Analytics tracking tags, you could stop them from firing by including:
google-analytics.com