Regular Expressions

In computing, a regular expression, also referred to as regex or regexp, provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters.

The following examples illustrate a few specifications that could be expressed in a regular expression:

The sequence of characters "car" appearing consecutively in any context, such as in "car", "cartoon", or "bicarbonate"

The sequence of characters "car" occurring in that order with other characters between them, such as in "Icelander" or "chandler"

The word "car" when it appears as an isolated word

The word "car" when preceded by the word "blue" or "red"

The word "car" when not preceded by the word "motor"

A dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits (for example, "$100" or "$245.99").

Source: Wikipedia

Python regular expression syntax follows in the Perl lineage. The Python module re provides regular expression functionality. Regular Expressions are a sub-language embedded within the larger Python language.

Educational Resources

Python for Informatics

Chapter 11 - Regular Expressions (Slides, Printable Slides, Download Video, Streaming Video)

Dive Into Python 3

Chapter 5 - regular expressions

Python.org

Regular Expression HOWTO

Practice

Please post regex excercises and questions below. We can help each other learn and explore this robust and slightly difficult aspect of Python.

Task Discussion

Fer said:

11th exercises

http://pastebin.com/FM7eCUbw

on Nov. 12, 2013, 5:35 a.m.
Wouter Tebbens said:

The exercises of chapter 11 are here: http://pastebin.com/QM5eqJkh

on April 7, 2013, 11:03 a.m.
Rob said:

Chapter 11 Exercises

on Sept. 29, 2012, 4:29 p.m.
saravanan said:

Chapter 11

on Sept. 19, 2012, 9:28 a.m.
Mars83 said:
My exercises:

http://pastebin.com/TTzzzwh7

http://pastebin.com/ckL91WZc

@Link discussion:

I would not change the links.

I think in "Python for Informatics" you get a first impression of re.search() and re.findall().
But in "Dive Into Python 3" there are many details not listed in the first source of information.
For example to set the number of matches with {1, 3} and re.sub().
on Sept. 11, 2012, 7:52 p.m.
jeroenrijckaert said:

I've updated the link "Dive Into Python 3" since the old link was broken. It still points to a general page, not a expression-specific. I think it is better to link to http://getpython3.com/diveintopython3/regular-expressions.html.

Should this be changed? Or just delete the link and use one source of information (Python for Informatics) to keep it simple?

What do you think?

on July 16, 2012, 6 a.m.
ionut.vlasin said:

Exercise 11.1 Write a simple program to simulate the operation of the the grep command on
UNIX. Ask the user to enter a regular expression and count the number of lines that matched
the regular expression:

fhand=open('fis.txt')
a=raw_input("Please enter the expresion that you want to search in the file: ")
import re
count=0
for line in fhand:
    line=line.rstrip()
    if re.search(a, line):
        count=count+1
print "we have ",count,"lines which contain",a

Exercise 11.2 Write a program to look for lines of the form
New Revision: 39772
And extract the number from each of the lines using a regular expression and the findall()
method. Compute the average of the numbers and print out the average.
import re
fhand=open('fis1.txt')
count=0
_sum=0
for line in fhand:
    a=re.findall('^N.* R.*: ([0-9].*)',line)
    count=count+1
    _sum=_sum+float(a[0])
print "Avg=",_sum/count

on July 11, 2012, 5:45 a.m.
pannix said:
exercise 11.1

exercise 11.2
on May 25, 2012, 4:33 a.m.
gnuisance said:

11.1
11.2

on March 11, 2012, 5:04 p.m.

RobG717 said:

Exercise 11.1

import re
mbox = open('mbox.txt', 'r')

def grep(expression, file):
    matches = 0
    for line in file:
        line = line.rstrip()
        if re.search(expression, line):
            matches += 1
    return matches

def main():
    """The main program loop"""
    expression = raw_input("Enter a regular expression: ")
    matches = grep(expression, mbox)
    print "mbox.txt had %d lines that matched %s" % (matches, expression)

if __name__ == '__main__':
    main()

Exercise 11.2

import re

def getAverage(filename):
    f = open(filename, 'r')
    matches = []
    for line in f:
        matches.extend(re.findall('^New Revision: (\d+)', line))
    fltmatches = [float(x) for x in matches] #convert values to floats
    average = sum(fltmatches) / len(fltmatches)
    return average

def main():
    """Main program loop"""
    filename = raw_input("Enter file: ")
    average = getAverage(filename)
    print average

if __name__ == '__main__':
    main()

on Feb. 28, 2012, 5:19 p.m.

Ken Doman said:

My answers for chapter 11 here.

on Jan. 17, 2012, 1:42 p.m.
dean said:

My answers:

http://goo.gl/xOalm

on Oct. 12, 2011, 12:53 p.m.

Sudaraka Wijesinghe said:

My py4int exercise 11 code

Exercise 11.1

import re

fname = 'mbox.txt'

rexp = raw_input('Enter a regular expression: ')

try:
    fh=open(fname)
except:
    print 'Unble to open', fname
    exit()

match_count = 0

for line in fh :
    if re.search(rexp, line) :
        match_count = match_count + 1

print fname, 'had', match_count, 'lines that matched', rexp

Exercise 11.2

import re

# Open file get the file name and open it
fname = raw_input('Enter a file name: ')
if 1 > len(fname) :
    fname = 'mbox-short.txt'

try:
    fh=open(fname)
except:
    print 'Unble to open', fname
    exit()

rev_list = []

# Read each line and find out message count for each email address
for line in fh :
    rev = re.findall('^New Revision: (\d+)', line)
    if 1 > len(rev) :
        continue

    rev_list = rev_list + [int(rev[0])]

print 'Average Revision:', sum(rev_list)/len(rev_list)

on July 18, 2011, 3:26 a.m.

Vladimir Támara Patiño said:

There is a small error in exercise 11.1, the last example says:

mbox.txt had 4218 lines that matched java$

But it should be:

mbox.txt had 4175 lines that matched java$

The official documentation for regular expressions in python is at:

http://docs.python.org/library/re.html

11.1

sre = raw_input('Enter a regular expression: ')
fname = 'mbox.txt'
try:
    fhand = open(fname)
except:
    print 'File cannot be opened:', fname
    exit()
lm = 0
for line in fhand:
    n = re.findall(sre, line)
    if len(n) > 0:
        lm = lm + 1
fhand.close()
print fname, 'had', lm, 'lines that matched', sre

11.2

fname = raw_input('Enter a file name: ')
try:
    fhand = open(fname)
except:
    print 'File cannot be opened:', fname
    exit()
sum = 0
num = 0
for line in fhand:
    n = re.findall(r'\s*New\s+Revision:\s*([0-9]+)', line)
    for i in n:
        sum = sum + int(i)
        num = num + 1
fhand.close()

if num > 0:
    print 'Average', float(sum)/num
else:
    print 'Not lines with "New Revision:" found in', fname

on June 15, 2011, 8:16 a.m.

Nathan Day said:

Are there exercises to be released? or are we as a team intended to build our own?

on June 1, 2011, 4:50 p.m.

Tyler Cipriani said:

I took it to mean do the exercises in the course book (Python for Informatics) at the end of the Regex chapter (chapter 11). The other resources listed here are likely for people using different Ptyhon versions than the 2.6/2.7 used in the course text.

My Chapter 11 exercise answers are posted at https://github.com/thcipriani/thcipriani.github.com/blob/master/python101/chapter11.py for reference

Also, since Regular Expressions are an incredibly dense and useful concept for almost everything you do with coding/computers here is a really helpful tool for learning:

http://gskinner.com/RegExr/

Making your own exercises might be a cool idea since this is a very dense subject and certainly would help understanding. Trying to parsing out some info from an Apache log file or something similar would be a good exercise.

on June 2, 2011, 1:36 p.m. in reply to Nathan Day

Nathan Day said:

Would you show me where to find the mbox.txt file and mbox-short.txt file?

Thanks,

Nate

on June 3, 2011, 1:35 p.m. in reply to Tyler Cipriani

Tyler Cipriani said:

Sure -

mbox: http://www.py4inf.com/code/mbox.txt

mbox-short: http://www.py4inf.com/code/mbox-short.txt

All of the code in the book can be found on the index list at: http://www.py4inf.com/code/

on June 3, 2011, 1:43 p.m. in reply to Nathan Day

Nathan Day said:

Awesome thanks!

on June 3, 2011, 2:06 p.m. in reply to Tyler Cipriani