Regular Expressions in Python

Introduction

Regular expressions also known as Regex are used to find search patterns. They are a sequence of characters that can be used to describe what we are trying to search. They can be used for all text based search and replace operations.

Regular Expressions in python

The match Function

The match function searches for a pattern in a string which can be passed with two optional flags. This is the syntax for the match function.

Parameter Description:

Pattern => This is a regular expression that has to be matched.
String => This is the string that has to be searched to match the pattern

 

The search function

The search functions searches for the first occurence of a pattern in a string that can again be passed with two optional flags. This is the syntax for the search function.

Syntax:

import re
re.search(pattern, string, flags=0)

Parameter Description:

Parameter Description
pattern This is the regular expression to be matched.
string This is the string, which would be searched to match the pattern anywhere in the string.
flags You can specify different flags using bitwise OR (|). These are modifiers, which are listed in the table below.
Match Object Methods Description
group(num=0) This method returns entire match (or specific subgroup num)
groups() This method returns all matching subgroups in a tuple (empty if there weren't any)

The re.search returns a match object if successful else returns None.

Example

#!/usr/bin/python
import re

line = "Tigers are smarter than Lions";

searc = re.search( r'(.*) than (.*?) .*', line, re.M|re.I)

if searc:
   print "searc.group() : ", searc.group()
   print "searc.group(1) : ", searc.group(1)
   print "searc.group(2) : ", searc.group(2)
else:
   print "Nothing found!!"

The output will be like

searc.group() :  Tigers are smarter than Lions
searchObj.group(1) :  samrter
searchObj.group(2) :  Lions

The sub function

The sub function searches for the given pattern and replaces the the matching string with the user provided string.

Syntax

import re
re.sub(pattern, repl, string, max=0)

Example

#!/usr/bin/python
import re
phone = "8500-451-999 # This is Phone Number"
# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print "Phone Num : ", num
# Remove anything other than digits
num = re.sub(r'\D', "", phone)    
print "Phone Num : ", num

Output

Phone Num :  8500-451-999
Phone Num :  8500451999

Try

The findall function

The findall function provides all the relevant matches of pattern in a string as a list.

Syntax

import re
re.findall(r'pattern',string)

Example

import re
print(re.findall(r'\w','Greycampus'))

Output

['G', 'r', 'e', 'y', 'c', 'a', 'm', 'p', 'u', 's']

Try

The split function

This function splits the string at the point where the pattern matches with the string.

Syntax

import re
re.split(pattern,string)

Example

import re
print(re.split(r'-','Greycampus-Python'))

Output

['Greycampus', 'Python']

The start and end functions

These functions return the indices of the start and end respectively of the substring matched by the pattern.

Example

import re
k = re.search(r'\d+','13vv1a1238')
k.end()
k.start()

ouput

2
0

Regular Expression Modifiers: Option Flags

ModifierDescription
re.IPerforms case-insensitive matching.
re.LInterprets words according to the current locale. This interpretation affects the alphabetic group (\w and \W), as well as word boundary behavior (\b and \B).
re.MMakes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string).
re.SMakes a period (dot) match any character, including a newline.
re.UInterprets letters according to the Unicode character set. This flag affects the behavior of \w, \W, \b, \B.
re.XPermits "cuter" regular expression syntax. It ignores whitespace (except inside a set [] or when escaped by a backslash) and treats unescaped # as a comment marker.

Regular Expression Patterns

Except control characters + ? . * ^ $ ( ) [ ] { } | \ all characters match themselves. You can escape a control character by preceeding it with a backslash.

PatternDescription
^Matches beginning of line.
$Matches end of line.
.Matches any single character except newline. Using m option allows it to match newline as well.
[...]Matches any single character in brackets.
[^...]Matches any single character not in brackets
re*Matches 0 or more occurrences of preceding expression.
re+Matches 1 or more occurrence of preceding expression.
re?Matches 0 or 1 occurrence of preceding expression.
re{ n}Matches exactly n number of occurrences of preceding expression.
re{ n,}Matches n or more occurrences of preceding expression.
re{ n, m}Matches at least n and at most m occurrences of preceding expression.
a| bMatches either a or b.
(re)Groups regular expressions and remembers matched text.
(?imx)Temporarily toggles on i, m, or x options within a regular expression. If in parentheses, only that area is affected.
(?-imx)Temporarily toggles off i, m, or x options within a regular expression. If in parentheses, only that area is affected.
(?: re)Groups regular expressions without remembering matched text.
(?imx: re)Temporarily toggles on i, m, or x options within parentheses.
(?-imx: re)Temporarily toggles off i, m, or x options within parentheses.
(?#...)Comment.
(?= re)Specifies position using a pattern. Doesn't have a range.
(?! re)Specifies position using pattern negation. Doesn't have a range.
(?>; re)Matches independent pattern without backtracking.
\wMatches word characters.
\WMatches nonword characters.
\sMatches whitespace. Equivalent to [\t\n\r\f].
\SMatches nonwhitespace.
\dMatches digits. Equivalent to [0-9].
\DMatches nondigits.
\AMatches beginning of string.
\ZMatches end of string. If a newline exists, it matches just before newline.
\zMatches end of string.
\GMatches point where last match finished.
\bMatches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.
\BMatches nonword boundaries.
\n, \t, etc.Matches newlines, carriage returns, tabs, etc.
\1...\9Matches nth grouped subexpression.
\10Matches nth grouped subexpression if it matched already. Otherwise refers to the octal representation of a character code.

Character classes

ExampleDescription
[Pp]ython Match "Python" or "python"
rub[ye]Match "ruby" or "rube"
[aeiou]Match any one lowercase vowel
[0-9]Match any digit; same as [0123456789]
[a-z]Match any lowercase ASCII letter
[A-Z]Match any uppercase ASCII letter
[a-zA-Z0-9]Match any of the above
[^aeiou]Match anything other than a lowercase vowel
[^0-9]Match anything other than a digit

Special Character classes

ExampleDescription
.Match any character except newline
\dMatch a digit: [0-9]
\D Match a nondigit: [^0-9]
\sMatch a whitespace character: [ \t\r\n\f]
\SMatch nonwhitespace: [^ \t\r\n\f]
\wMatch a single word character: [A-Za-z0-9_]
\WMatch a nonword character: [^A-Za-z0-9_]

Repetition cases

ExampleDescription
python? Match "pytho" or "python": the n is optional
ruby* Match "rub" plus 0 or more ys
python+Match "pytho" plus 1 or more ns
\d{5}Match exactly 5 digits
\d{3,}Match 3 or more digits
\d{6,8}Match 6, 7, or 8 digits

Try