- Home
- python Tutorial
- regular-expressions
Regular Expressions in Python
Introduction
Regular expressions also known as Regex are used to find search patterns. They are a sequence of characters that can be used to describe what we are trying to search. They can be used for all text based search and replace operations.
Regular Expressions in python
The match Function
The match function searches for a pattern in a string which can be passed with two optional flags. This is the syntax for the match function.
Parameter Description:
Pattern => This is a regular expression that has to be matched.
String => This is the string that has to be searched to match the pattern
The search function
The search functions searches for the first occurence of a pattern in a string that can again be passed with two optional flags. This is the syntax for the search function.
Syntax:
import re
re.search(pattern, string, flags=0)
Parameter Description:
Parameter | Description |
---|---|
pattern | This is the regular expression to be matched. |
string | This is the string, which would be searched to match the pattern anywhere in the string. |
flags | You can specify different flags using bitwise OR (|). These are modifiers, which are listed in the table below. |
Match Object Methods | Description |
---|---|
group(num=0) | This method returns entire match (or specific subgroup num) |
groups() | This method returns all matching subgroups in a tuple (empty if there weren't any) |
The re.search
returns a match object if successful else returns None
.
Example
#!/usr/bin/python
import re
line = "Tigers are smarter than Lions";
searc = re.search( r'(.*) than (.*?) .*', line, re.M|re.I)
if searc:
print "searc.group() : ", searc.group()
print "searc.group(1) : ", searc.group(1)
print "searc.group(2) : ", searc.group(2)
else:
print "Nothing found!!"
The output will be like
searc.group() : Tigers are smarter than Lions
searchObj.group(1) : samrter
searchObj.group(2) : Lions
The sub function
The sub function searches for the given pattern and replaces the the matching string with the user provided string.
Syntax
import re
re.sub(pattern, repl, string, max=0)
Example
#!/usr/bin/python
import re
phone = "8500-451-999 # This is Phone Number"
# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print "Phone Num : ", num
# Remove anything other than digits
num = re.sub(r'\D', "", phone)
print "Phone Num : ", num
Output
Phone Num : 8500-451-999
Phone Num : 8500451999
The findall function
The findall function provides all the relevant matches of pattern in a string as a list.
Syntax
import re
re.findall(r'pattern',string)
Example
import re
print(re.findall(r'\w','Greycampus'))
Output
['G', 'r', 'e', 'y', 'c', 'a', 'm', 'p', 'u', 's']
The split function
This function splits the string at the point where the pattern matches with the string.
Syntax
import re
re.split(pattern,string)
Example
import re
print(re.split(r'-','Greycampus-Python'))
Output
['Greycampus', 'Python']
The start
and end
functions
These functions return the indices of the start and end respectively of the substring matched by the pattern.
Example
import re
k = re.search(r'\d+','13vv1a1238')
k.end()
k.start()
ouput
2
0
Regular Expression Modifiers: Option Flags
Modifier | Description |
---|---|
re.I | Performs case-insensitive matching. |
re.L | Interprets words according to the current locale. This interpretation affects the alphabetic group (\w and \W), as well as word boundary behavior (\b and \B). |
re.M | Makes $ match the end of a line (not just the end of the string) and makes ^ match the start of any line (not just the start of the string). |
re.S | Makes a period (dot) match any character, including a newline. |
re.U | Interprets letters according to the Unicode character set. This flag affects the behavior of \w, \W, \b, \B. |
re.X | Permits "cuter" regular expression syntax. It ignores whitespace (except inside a set [] or when escaped by a backslash) and treats unescaped # as a comment marker. |
Regular Expression Patterns
Except control characters + ? . * ^ $ ( ) [ ] { } | \ all characters match themselves. You can escape a control character by preceeding it with a backslash.
Pattern | Description |
---|---|
^ | Matches beginning of line. |
$ | Matches end of line. |
. | Matches any single character except newline. Using m option allows it to match newline as well. |
[...] | Matches any single character in brackets. |
[^...] | Matches any single character not in brackets |
re* | Matches 0 or more occurrences of preceding expression. |
re+ | Matches 1 or more occurrence of preceding expression. |
re? | Matches 0 or 1 occurrence of preceding expression. |
re{ n} | Matches exactly n number of occurrences of preceding expression. |
re{ n,} | Matches n or more occurrences of preceding expression. |
re{ n, m} | Matches at least n and at most m occurrences of preceding expression. |
a| b | Matches either a or b. |
(re) | Groups regular expressions and remembers matched text. |
(?imx) | Temporarily toggles on i, m, or x options within a regular expression. If in parentheses, only that area is affected. |
(?-imx) | Temporarily toggles off i, m, or x options within a regular expression. If in parentheses, only that area is affected. |
(?: re) | Groups regular expressions without remembering matched text. |
(?imx: re) | Temporarily toggles on i, m, or x options within parentheses. |
(?-imx: re) | Temporarily toggles off i, m, or x options within parentheses. |
(?#...) | Comment. |
(?= re) | Specifies position using a pattern. Doesn't have a range. |
(?! re) | Specifies position using pattern negation. Doesn't have a range. |
(?>; re) | Matches independent pattern without backtracking. |
\w | Matches word characters. |
\W | Matches nonword characters. |
\s | Matches whitespace. Equivalent to [\t\n\r\f]. |
\S | Matches nonwhitespace. |
\d | Matches digits. Equivalent to [0-9]. |
\D | Matches nondigits. |
\A | Matches beginning of string. |
\Z | Matches end of string. If a newline exists, it matches just before newline. |
\z | Matches end of string. |
\G | Matches point where last match finished. |
\b | Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets. |
\B | Matches nonword boundaries. |
\n, \t, etc. | Matches newlines, carriage returns, tabs, etc. |
\1...\9 | Matches nth grouped subexpression. |
\10 | Matches nth grouped subexpression if it matched already. Otherwise refers to the octal representation of a character code. |
Character classes
Example | Description |
---|---|
[Pp]ython | Match "Python" or "python" |
rub[ye] | Match "ruby" or "rube" |
[aeiou] | Match any one lowercase vowel |
[0-9] | Match any digit; same as [0123456789] |
[a-z] | Match any lowercase ASCII letter |
[A-Z] | Match any uppercase ASCII letter |
[a-zA-Z0-9] | Match any of the above |
[^aeiou] | Match anything other than a lowercase vowel |
[^0-9] | Match anything other than a digit |
Special Character classes
Example | Description |
---|---|
. | Match any character except newline |
\d | Match a digit: [0-9] |
\D | Match a nondigit: [^0-9] |
\s | Match a whitespace character: [ \t\r\n\f] |
\S | Match nonwhitespace: [^ \t\r\n\f] |
\w | Match a single word character: [A-Za-z0-9_] |
\W | Match a nonword character: [^A-Za-z0-9_] |
Repetition cases
Example | Description |
---|---|
python? | Match "pytho" or "python": the n is optional |
ruby* | Match "rub" plus 0 or more ys |
python+ | Match "pytho" plus 1 or more ns |
\d{5} | Match exactly 5 digits |
\d{3,} | Match 3 or more digits |
\d{6,8} | Match 6, 7, or 8 digits |