Introduction #
Regular expressions, or regex, are a powerful tool for pattern matching and string manipulation. In Python, the re
module provides support for working with regex.
1. Basics of Regex #
1.1 What is Regex? #
Regex is a sequence of characters that forms a search pattern. It can be used to check if a string contains a specified pattern, extract portions of strings, or replace them.
1.2 Common Regex Syntax #
Pattern | Description |
---|---|
. | Matches any single character except newline. |
^ | Matches the start of a string. |
$ | Matches the end of a string. |
* | Matches 0 or more repetitions. |
+ | Matches 1 or more repetitions. |
? | Matches 0 or 1 occurrence. |
{n} | Matches exactly n repetitions. |
{n,} | Matches n or more repetitions. |
{n,m} | Matches between n and m repetitions. |
[] | Denotes a set of characters to match. |
() | Groups expressions together. |
\ | Escapes a special character. |
1.3 Importing the re
Module #
import re
2. Using Regex in Python #
2.1 Matching Patterns #
re.match()
#
Matches a pattern at the beginning of a string.
import re
pattern = r"^Hello"
text = "Hello, World!"
match = re.match(pattern, text)
if match:
print("Pattern matched at the start of the string.")
re.search()
#
Searches for a pattern anywhere in the string.
pattern = r"World"
text = "Hello, World!"
match = re.search(pattern, text)
if match:
print("Pattern found in the string.")
2.2 Finding All Matches #
re.findall()
#
Finds all occurrences of a pattern.
pattern = r"\d+" # Matches one or more digits
text = "There are 12 cats and 8 dogs."
matches = re.findall(pattern, text)
print(matches) # Output: ['12', '8']
2.3 Splitting Strings #
re.split()
#
Splits a string by the occurrences of a pattern.
pattern = r",\s*" # Matches commas followed by any whitespace
text = "apple, banana, cherry"
split_text = re.split(pattern, text)
print(split_text) # Output: ['apple', 'banana', 'cherry']
2.4 Replacing Patterns #
re.sub()
#
Replaces occurrences of a pattern with a specified string.
pattern = r"\s+" # Matches one or more spaces
text = "Python is fun."
result = re.sub(pattern, " ", text)
print(result) # Output: 'Python is fun.'
3. Advanced Topics #
3.1 Groups and Capturing #
Groups are created using parentheses ()
and can capture parts of a match.
pattern = r"(\d{3})-(\d{3})-(\d{4})" # Matches phone numbers
text = "Call me at 123-456-7890."
match = re.search(pattern, text)
if match:
print(match.group(0)) # Full match: 123-456-7890
print(match.group(1)) # First group: 123
print(match.group(2)) # Second group: 456
print(match.group(3)) # Third group: 7890
3.2 Non-Capturing Groups #
Non-capturing groups use (?:...)
and do not capture text for back-references.
pattern = r"(?:Hello|Hi), (\w+)"
text = "Hello, Alice"
match = re.search(pattern, text)
if match:
print(match.group(1)) # Output: Alice
3.3 Lookahead and Lookbehind #
Lookahead and lookbehind assertions are used to match patterns based on context.
Positive Lookahead #
pattern = r"\w+(?=@example\.com)"
text = "user@example.com"
match = re.search(pattern, text)
print(match.group()) # Output: user
Negative Lookbehind #
pattern = r"(?<!\$)\d+"
text = "Price: $50, Discount: 20"
matches = re.findall(pattern, text)
print(matches) # Output: ['20']
4. Flags in Regex #
Flags modify the behavior of regex patterns.
Flag | Description |
re.IGNORECASE | Makes the pattern case-insensitive. |
re.MULTILINE | Enables multi-line matching. |
re.DOTALL | Allows . to match newline characters. |
re.VERBOSE | Allows whitespace and comments in the pattern. |
Example with Flags: #
pattern = r"(?i)hello"
text = "Hello, world!"
match = re.search(pattern, text)
print(bool(match)) # Output: True
5. Common Pitfalls #
- Greedy vs Non-Greedy Matching:
- Greedy:
.*
matches as much text as possible. - Non-Greedy:
.*?
matches as little text as possible.
text = "<tag>content</tag>" greedy = re.search(r"<.*>", text) non_greedy = re.search(r"<.*?>", text) print(greedy.group()) # Output: <tag>content</tag> print(non_greedy.group()) # Output: <tag>
- Greedy:
- Escaping Special Characters: Use
\
to escape characters like.
,*
,+
, etc. - Overuse: Avoid complex regex patterns when simpler string methods suffice.
6. Best Practices #
- Test Patterns: Use tools like regex101 to test your patterns.
- Keep it Simple: Use comments and
re.VERBOSE
for readability. - Optimize Performance: Avoid overly complex patterns for large text processing.
Conclusion #
Regex is a versatile tool for text processing and pattern matching in Python. While it can handle complex scenarios, understanding its syntax, capabilities, and limitations is essential for effective use.