Regular Expressions
Introduction
A regular expression, sometimes referred to as a regex, is a sequence of characters which represents a pattern of text. Regular expressions are often made use of in pattern matching and searching. For example, suppose you were developing an application that required the user to enter in a phone number in a specific format. You might want to perform validation on their input to ensure that the phone number that they entered is in the correct format before you wrote it to the database. This is a perfect use case for regular expressions. In this case, we would perform pattern matching between the input string and a regular expression defining the required pattern for the telephone number input. To illustrate, let's look at a piece of Python code which could perform a validation of a phone number:
def validate_phone_number(value):
regex = re.compile(r'^(?:(\d{3})[-. ])?(\d{4})[-. ](\d{3})')
if rule.search(value):
return true
else:
return false
In the above code, we are defining a function called validate_phone_number which takes in a single parameter, value, which in our above scenario would be the input from the user which we are checking against our regular expression, which in this case is:
^(?:(\d{3})[-. ])?(\d{4})[-. ](\d{3})
This expression will match against telephone numbers with the following format:
000-0000-000
So, for example, the follow would be valid:
- 082-4569-654
- 071-1597-852
- 061-6547-789
The following would be invalid:
- 0541234569
- 014-987-9874
- 123
To check this out for yourself, you can use a very useful webservice called Pythex. Which lets you enter in a regular expression and some text to test against, and it will output the matches. Give it a try!
Building Regular Expressions
Let's get started with some basic patterns.
Basic Patterns
These symbols may be used within the regular expression to match certain types of characters.
- a, B, 0, 1: Alphanumeric characters simply match themselves. The same is true for punctuation characters, such as space, comma, and so on.
- . (full stop): This is a wildchar which matches any single character (except the new-line character).
- \w: Matches any alphanumeric character including underscore [0-9a-zA-Z_]. An easy way to remember this is to thing Word characters. In other words, any character you might find within a word.
- \s: Match a single whitespace character (space, newline, return, tab).
- \S: Matches any non-whitespace character.
- \t: Tab
- \n: Newline
- \r: Return character
- \d: Matches a decimal digit [0-9].
Meta Characters
These characters have special meaning within a regular expression.
- ^: Match the start of the string.
- $: Match the end of the string.
- **: Escape character. This removes the "specialness" of the character. For example, if you wanted to match the dollar sign ($), you would escape it with a backslash like this "\$" to indicate to the compiler that we want to match that character, not use it's special meaning of matching the end of the string.
- [ ]: The square brackets indicates a set of characters to match agains. For example [abc] will match against "a", "b" or "c". It could also indicate a range. For example [a-z] will match any lower case alphabetical character from 'a' to 'z'.
- | : Match x OR y. For example A|B will match either A or B.
Repetition
These symbols are used to indicate requirements for repetitions within the pattern.
- +: Match one or more occurrences. For example 0+ means match one or more 0's.
- * : Match zero or more occurrences.
- ? : Match zero or one occurrences.
- {m, n} : Requires a pattern to repeat itself a minimum of m times, and a maximum of n times.
Examples
match = re.search(r'iii', 'piiig') # Found "iii"
match = re.search(r'igs', 'piiig') # Found None
match = re.search(r'..g', 'piiig') # Found "iig"
match = re.search(r'\d\d\d', 'p123g') # Found "123"
match = re.search(r'\w\w\w', '@@abcd!!') # Found "abc"
match = re.search(r'pi+', 'piiig') # Found "piii"
match = re.search(r'i+', 'piigiiii') # Found "ii"
match = re.search(r'\d\s*\d\s*\d', 'xx1 2 3xx') # Found "1 2 3"
match = re.search(r'\d\s*\d\s*\d', 'xx12 3xx') # Found "12 3"
match = re.search(r'\d\s*\d\s*\d', 'xx123xx') # Found "123"
match = re.search(r'^b\w+', 'foobar') # Found None
match = re.search(r'b\w+', 'foobar') # Found "bar"
match = re.search(r'[\w.-]+@[\w.-]+', 'bob@thebuilder.com') # Matched Email address