Dave's Mess > Articles > Apache regular expression guide

Apache regular expression guide

Apache mod_rewrite is voodoo. Damned cool voodoo, but still voodoo.


Apache mod_rewrite allows you to harness the power of regular expressions but for many, the power is out of their control. This guide explains exactly what's going on inside mod_rewrite and how you can start using it's full capabilities.

A regular expression (known as a regex) consists of two parts, a pattern and an input string. In the case of Apache mod_rewrite, the pattern is the first part after the keyword RewriteRule and the input string is most frequently the URI that your user is requesting. There are examples of both below.

The purpose of a regex is to describe a subset of all possible strings. In Apache mod_rewrite we take the input string and test it against the regex to see if the regex describes the input string. Apache can then take some action depending on whether the input string matches the regex or not. At it's most simple, this consists of determining whether one string exists inside the other.

Input string: "/page2.html" Regex : "/page2.html" Matches : TRUE Input string: "/page2.html" Regex : "age" Matches : TRUE Input string: "/page3.html" Regex : "/page2.html" Matches : FALSE

Inside a regex pattern a character can be one of two sorts: a literal character or a metacharacter. In the above examples, all the characters were literal characters except one. The dot (.) is a metacharacter. Metacharacters affect the rest of the regex pattern in varying ways.

Metacharacters:
. * + { } ? ! [ ] - ^ $ | \ ( )

A dot (.) will match any single character.

Input string: "/page2.html" Regex : "..........." Matches : TRUE Input string: "/aStringWithMoreOrFewerThanElevenCharacters.html" Regex : "..........." Matches : FALSE

A star (*) will modify the pattern to mean zero or more of the previous character. The star modifier works with both literal and meta characters preceding it. The pattern (.*) will match any input string.

Input string: "/page2.html" Regex : ".*" Matches : TRUE Input string: "" Regex : ".*" Matches : TRUE

A plus (+) is much like the star (*) except that it matches one or more of the previous character. The plus modifier works with both literal and meta characters preceding it. The pattern (.+) will match any input string other than the empty string.

Input string: "/page2.html" Regex : ".+" Matches : TRUE Input string: "" Regex : ".+" Matches : FALSE

The curly brackets ({}) with a number inside them are much like the star and the plus except that they match the number inside the curly brackets of the previous character. The curly brackets modifier works with both literal and meta characters preceding it. Curly brackets can define a range using the start and end values of the range inside the curly brackets separated by a comma.

Input string: "abba" Regex : "ab{2}a" Matches : TRUE Input string: "abbbba" Regex : "ab{2,4}a" Matches : TRUE Input string: "abbbbbbbba" Regex : "ab{2,4}a" Matches : FALSE

A question mark (?) is much like the star and the plus except that it matches zero or one of the previous character. The question mark modifier works with both literal and meta characters preceding it.

Input string: "a" Regex : ".?" Matches : TRUE Input string: "" Regex : "." Matches : FALSE Input string: "" Regex : ".?" Matches : TRUE

The exclamation mark (!) as the first character negates the entire regular expression. So anything that would normally match the regular expression won't match it and anything that would not normally match it now will.

Input string: "a" Regex : "a" Matches : TRUE Input string: "a" Regex : "!a" Matches : FALSE Input string: "a" Regex : "!b" Matches : TRUE

The square brackets ([]) will match any of the characters inside them. Square brackets can contain both literal and meta characters. Any square bracket sequence with a dot (.) in it will match any input string.

Input string: "a" Regex : "[ab]" Matches : TRUE Input string: "b" Regex : "[ab]" Matches : TRUE Input string: "c" Regex : "[ab]" Matches : FALSE Input string: "c" Regex : "[ab.]" Matches : TRUE

The star and plus modifiers act on the entire contents of the square brackets.

Input string: "abababbbaab" Regex : "[ab]+" Matches : TRUE

The dash (-) character inside square brackets ([]) describes ranges if it is in between the two literal characters at the start and end of the range. A dash as the very first or very last character in the square brackets will be interpreted as a literal dash and not as a range.

Input string: "lotsoflowercasealphabeticcharacters" Regex : "[a-z]+" Matches : TRUE Input string: "UPPERCASE AND SPACES" Regex : "[a-z]+" Matches : FALSE Input string: "UPPERCASElowercase123456789" Regex : "[0-9a-zA-Z]+" Matches : TRUE

A caret or circumflex (^) when used inside square brackets as the first character describes the negation or opposite of what it normally would describe.

Input string: "UPPERCASE AND SPACES" Regex : "[^a-z]+" Matches : TRUE Input string: "UPPERCASElowercase123456789" Regex : "[^0-9a-zA-Z]+" Matches : FALSE

A caret or circumflex (^) when used as the first character of a pattern matches the start of the input string.

Input string: "page2.html" Regex : "^page" Matches : TRUE Input string: "page2.html" Regex : "^html" Matches : FALSE

The dollar sign ($) when used as the last character of a pattern matches the end of the input string.

Input string: "page2.html" Regex : "page$" Matches : FALSE Input string: "page2.html" Regex : "html$" Matches : TRUE

The pipe character (|) means match the value on the left or the value on the right. It can be used on individual characters or entire strings. It works with both literal and meta characters.

Input string: "page2.html" Regex : "page(2|3).html" Matches : TRUE Input string: "page3.html" Regex : "page(2|3).html" Matches : TRUE Input string: "page4.html" Regex : "page(2|3).html" Matches : FALSE Input string: "lowercase" Regex : "^([a-z]|[A-Z])+$" Matches : TRUE Input string: "UPPERCASE" Regex : "^([a-z]|[A-Z])+$" Matches : TRUE Input string: "lowercaseANDUPPERCASE" Regex : "^([a-z]+|[A-Z]+)$" Matches : FALSE

The backslash (\) enables you to turn any metacharacter into a literal character by placing the backslash in front of it. This is known as escaping. If you want to match the backslash itself, you can precede it with another backslash. A backslash-character sequence is treated as a single character by any modifiers following it. A backslash preceding a literal character is the same as just the literal character.

Input string: "..........." Regex : "\.+" Matches : TRUE Input string: "Any other string" Regex : "\.+" Matches : FALSE Input string: "\" Regex : "\\" Matches : TRUE

The parentheses () allow you to group parts of the pattern together. You can see an example of this in the section on the pipe character above. As a bonus, they also allow you to reference the parts of the input that were matched inside the parentheses later.

Input string: "/blog/category/apache" RewriteRule ^/blog/category/([a-zA-Z]+) /blog/index.php?category=$1 Result: /blog/index.php?category=apache

Tricky bits

Some common pitfalls that people run into when dealing with regexes.

Input string: "ccccccccc" Regex : "[ab]*" Matches : TRUE

Why ? Because the star matches ZERO or more... and the are zero 'a's and 'b's in the input string.

Input string: "ccccccccc" Regex : "[ab.]+" Matches : TRUE

Why ? Because the dot in the character class matches anything, and hence the whole string.

Input string: "/blog/2007/10/06/some-title" RewriteRule /(.*)/(.*)/(.*) /index.php?var1=$1&var2=$2&var3=$3

Why is this a problem ? Because (.*) matches everything, including the slashes ! var1 will actually equal "/blog/2007/10", var2 will equal "06" and var3 will equal "some-title". A safer way to match variables delimited by slashes is to use ([^/]+) instead of (.*) From our rules earlier, this means "match one or more of anything that is not a slash".

Input string: "cccccccabbaccccccccc" Regex : "[ab]+" Matches : TRUE

Why ? Because the pattern is not anchored with a caret or a dollar and hence it does match the "abba" in the middle of the "c". Changing the regex to "^[ab]+", "[ab]+$" or "^[ab]+$" would mean that it would not match this input string anymore.

The quote at the top is from the official Apache documentation for mod_rewrite. A rather convenient mod_rewrite cheatsheet can be found at I Love Jack Daniels' mod_rewrite cheatsheet. It has now moved to Added bytes mod_rewrite cheatsheet

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.


Comments


On Tue 30th Oct 2007 at 2am Prathik M said:

Hey are you good in PHP???
Comments





Limited HTML
Like BBCode
Common Usage
What's all this ?