Pattern Matching with robots.txt

The robots.txt file is a simple text file placed in the root directory of your site which offers specific instructions for how robots can access your site. The SEO implications of this are that you can restrict certain pages from being indexed helping to ensure that only relevant pages are made available to search engines and helping to retain PageRank on those pages. The robots.txt file uses three rules; User-agent (the robot the rules will apply to), Disallow (the URL that is to be blocked from the given robot) and Allow (the URL that the robot is allowed access too). Asides from blocking given URLS, pattern matching can be used to disallow access to any URL that, as you may of guessed, matches the pattern.

Matching a sequence of characters – This can be done through use of an asterisk (*) to block any directories or files that match the given text, for example the following rules will disallow the robots from any directory beginning with “admin”:

User-agent: Googlebot
Disallow: /admin*/

Blocking a URL containing a question mark – This is a useful technique when dealing with pages that are passing user submitted data through the URL. In the following example the rules disallow any page that contains a “?”:

User-agent: Googlebot
Disallow: /*?

Matching the end of a URL – This can be achieved through use of the “$” symbol. By placing this at the end of a URL you can specify that this rule applies to URLs that end here. Using the example of user submitted information being passed through the URL, the following rules can be used to ensure that any page containing information after the “?” to be disallowed, whilst pages with nothing after “?” to be allowed:

User-agent: Googlebot
Allow: /*?$
Disallow: /*?

Nick Price
SEO Programmer

Leave a Reply

Your email address will not be published. Required fields are marked *