This course will become read-only in the near future. Tell us at community.p2pu.org if that is a problem.

Regular Expressions



In computing, a regular expression, also referred to as regex or regexp, provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters.

The following examples illustrate a few specifications that could be expressed in a regular expression:

  • The sequence of characters "car" appearing consecutively in any context, such as in "car", "cartoon", or "bicarbonate"
  • The sequence of characters "car" occurring in that order with other characters between them, such as in "Icelander" or "chandler"
  • The word "car" when it appears as an isolated word
  • The word "car" when preceded by the word "blue" or "red"
  • The word "car" when not preceded by the word "motor"
  • A dollar sign immediately followed by one or more digits, and then optionally a period and exactly two more digits (for example, "$100" or "$245.99").
Source: Wikipedia

Python regular expression syntax follows in the Perl lineage. The Python module re provides regular expression functionality. Regular Expressions are a sub-language embedded within the larger Python language.
 

Educational Resources

Python for Informatics

 

Dive Into Python 3

Chapter 5 - regular expressions

Python.org

Practice

Please post regex excercises and questions below. We can help each other learn and explore this robust and slightly difficult aspect of Python.

Task Discussion


  • Fer said:

    on Nov. 12, 2013, 5:35 a.m.
  • Wouter Tebbens said:

    The exercises of chapter 11 are here: http://pastebin.com/QM5eqJkh

    on April 7, 2013, 11:03 a.m.
  • Rob said:

    on Sept. 29, 2012, 4:29 p.m.
  • saravanan said:

    on Sept. 19, 2012, 9:28 a.m.
  • Mars83 said:

    My exercises:

    • http://pastebin.com/TTzzzwh7
    • http://pastebin.com/ckL91WZc

    @Link discussion:

    I would not change the links.

    I think in "Python for Informatics"  you get a first impression of re.search() and re.findall().
    But in "Dive Into Python 3" there are many details not listed in the first source of information.
    For example to set the number of matches with {1, 3} and re.sub().

    on Sept. 11, 2012, 7:52 p.m.
  • jeroenrijckaert said:

    I've updated the link "Dive Into Python 3" since the old link was broken. It still points to a general page, not a expression-specific. I think it is better to link to http://getpython3.com/diveintopython3/regular-expressions.html.

    Should this be changed? Or just delete the link and use one source of information (Python for Informatics) to keep it simple?

    What do you think?

    on July 16, 2012, 6 a.m.
  • ionut.vlasin said:

    Exercise 11.1 Write a simple program to simulate the operation of the the grep command on
    UNIX. Ask the user to enter a regular expression and count the number of lines that matched
    the regular expression:

    fhand=open('fis.txt')
    a=raw_input("Please enter the expresion that you want to search in the file: ")
    import re
    count=0
    for line in fhand:
        line=line.rstrip()
        if re.search(a, line):
            count=count+1
    print "we have ",count,"lines which contain",a

    Exercise 11.2 Write a program to look for lines of the form
    New Revision: 39772
    And extract the number from each of the lines using a regular expression and the findall()
    method. Compute the average of the numbers and print out the average.
    import re
    fhand=open('fis1.txt')
    count=0
    _sum=0
    for line in fhand:
        a=re.findall('^N.* R.*: ([0-9].*)',line)
        count=count+1
        _sum=_sum+float(a[0])
    print "Avg=",_sum/count

    on July 11, 2012, 5:45 a.m.
  • pannix said:

    on May 25, 2012, 4:33 a.m.
  • gnuisance said:

    on March 11, 2012, 5:04 p.m.
  • RobG717 said:

    Exercise 11.1

    import re
    mbox = open('mbox.txt', 'r')
    
    def grep(expression, file):
        matches = 0
        for line in file:
            line = line.rstrip()
            if re.search(expression, line):
                matches += 1
        return matches
    
    def main():
        """The main program loop"""
        expression = raw_input("Enter a regular expression: ")
        matches = grep(expression, mbox)
        print "mbox.txt had %d lines that matched %s" % (matches, expression)
    
    if __name__ == '__main__':
        main()
    
    

    Exercise 11.2

    import re
    
    def getAverage(filename):
        f = open(filename, 'r')
        matches = []
        for line in f:
            matches.extend(re.findall('^New Revision: (\d+)', line))
        fltmatches = [float(x) for x in matches] #convert values to floats
        average = sum(fltmatches) / len(fltmatches)
        return average
    
    def main():
        """Main program loop"""
        filename = raw_input("Enter file: ")
        average = getAverage(filename)
        print average
    
    if __name__ == '__main__':
        main()
    on Feb. 28, 2012, 5:19 p.m.
  • Ken Doman said:

    My answers for chapter 11 here.

    on Jan. 17, 2012, 1:42 p.m.
  • dean said:

    My answers:

    http://goo.gl/xOalm

    on Oct. 12, 2011, 12:53 p.m.
  • Sudaraka Wijesinghe said:

    My py4int exercise 11 code

    Exercise 11.1

    import re
    
    fname = 'mbox.txt'
    
    rexp = raw_input('Enter a regular expression: ')
    
    try:
        fh=open(fname)
    except:
        print 'Unble to open', fname
        exit()
    
    match_count = 0
    
    for line in fh :
        if re.search(rexp, line) :
            match_count = match_count + 1
    
    print fname, 'had', match_count, 'lines that matched', rexp
    
    

     

    Exercise 11.2

    import re
    
    # Open file get the file name and open it
    fname = raw_input('Enter a file name: ')
    if 1 > len(fname) :
        fname = 'mbox-short.txt'
    
    try:
        fh=open(fname)
    except:
        print 'Unble to open', fname
        exit()
    
    rev_list = []
    
    # Read each line and find out message count for each email address
    for line in fh :
        rev = re.findall('^New Revision: (\d+)', line)
        if 1 > len(rev) :
            continue
    
        rev_list = rev_list + [int(rev[0])]
    
    print 'Average Revision:', sum(rev_list)/len(rev_list)
    
    

    on July 18, 2011, 3:26 a.m.
  • Vladimir Támara Patiño said:

    There is a small error in exercise 11.1, the last example says:

    mbox.txt had 4218 lines that matched java$

    But it should be:

    mbox.txt had 4175 lines that matched java$
     

    The official documentation for regular expressions in python is at:

    http://docs.python.org/library/re.html

    11.1

    sre = raw_input('Enter a regular expression: ')
    fname = 'mbox.txt'
    try:
        fhand = open(fname)
    except:
        print 'File cannot be opened:', fname
        exit()
    lm = 0
    for line in fhand:
        n = re.findall(sre, line)
        if len(n) > 0:
            lm = lm + 1
    fhand.close()
    print fname, 'had', lm, 'lines that matched', sre
    

     

    11.2

    fname = raw_input('Enter a file name: ')
    try:
        fhand = open(fname)
    except:
        print 'File cannot be opened:', fname
        exit()
    sum = 0
    num = 0
    for line in fhand:
        n = re.findall(r'\s*New\s+Revision:\s*([0-9]+)', line)
        for i in n:
            sum = sum + int(i)
            num = num + 1
    fhand.close()
    
    if num > 0:
        print 'Average', float(sum)/num
    else:
        print 'Not lines with "New Revision:" found in', fname
    
    
    on June 15, 2011, 8:16 a.m.
  • Nathan Day said:

    Are there exercises to be released? or are we as a team intended to build our own?

    on June 1, 2011, 4:50 p.m.

    Tyler Cipriani said:

     

    I took it to mean do the exercises in the course book (Python for Informatics) at the end of the Regex chapter (chapter 11). The other resources listed here are likely for people using different Ptyhon versions than the 2.6/2.7 used in the course text. 
     
    My Chapter 11 exercise answers are posted at https://github.com/thcipriani/thcipriani.github.com/blob/master/python101/chapter11.py for reference
     
    Also, since Regular Expressions are an incredibly dense and useful concept for almost everything you do with coding/computers here is a really helpful tool for learning:
     
    Making your own exercises might be a cool idea since this is a very dense subject and certainly would help understanding. Trying to parsing out some info from an Apache log file or something similar would be a good exercise.
    on June 2, 2011, 1:36 p.m. in reply to Nathan Day

    Nathan Day said:

    Would you show me where to find the mbox.txt file and mbox-short.txt file?

    Thanks,

    Nate

    on June 3, 2011, 1:35 p.m. in reply to Tyler Cipriani

    Tyler Cipriani said:

    Sure - 

    mbox: http://www.py4inf.com/code/mbox.txt

    mbox-short: http://www.py4inf.com/code/mbox-short.txt

    All of the code in the book can be found on the index list at: http://www.py4inf.com/code/

    on June 3, 2011, 1:43 p.m. in reply to Nathan Day

    Nathan Day said:

    Awesome thanks!

    on June 3, 2011, 2:06 p.m. in reply to Tyler Cipriani