Python: Regular Expressions (Regex)

Introduction #

Regular expressions, or regex, are a powerful tool for pattern matching and string manipulation. In Python, the re module provides support for working with regex.

1. Basics of Regex #

1.1 What is Regex? #

Regex is a sequence of characters that forms a search pattern. It can be used to check if a string contains a specified pattern, extract portions of strings, or replace them.

1.2 Common Regex Syntax #

Pattern	Description
`.`	Matches any single character except newline.
`^`	Matches the start of a string.
`$`	Matches the end of a string.
`*`	Matches 0 or more repetitions.
`+`	Matches 1 or more repetitions.
`?`	Matches 0 or 1 occurrence.
`{n}`	Matches exactly `n` repetitions.
`{n,}`	Matches `n` or more repetitions.
`{n,m}`	Matches between `n` and `m` repetitions.
`[]`	Denotes a set of characters to match.

`()`	Groups expressions together.
`\`	Escapes a special character.

1.3 Importing the `re` Module #

import re

2. Using Regex in Python #

2.1 Matching Patterns #

`re.match()` #

Matches a pattern at the beginning of a string.

import re

pattern = r"^Hello"
text = "Hello, World!"
match = re.match(pattern, text)
if match:
    print("Pattern matched at the start of the string.")

`re.search()` #

Searches for a pattern anywhere in the string.

pattern = r"World"
text = "Hello, World!"
match = re.search(pattern, text)
if match:
    print("Pattern found in the string.")

2.2 Finding All Matches #

`re.findall()` #

Finds all occurrences of a pattern.

pattern = r"\d+"  # Matches one or more digits
text = "There are 12 cats and 8 dogs."
matches = re.findall(pattern, text)
print(matches)  # Output: ['12', '8']

2.3 Splitting Strings #

`re.split()` #

Splits a string by the occurrences of a pattern.

pattern = r",\s*"  # Matches commas followed by any whitespace
text = "apple, banana, cherry"
split_text = re.split(pattern, text)
print(split_text)  # Output: ['apple', 'banana', 'cherry']

2.4 Replacing Patterns #

`re.sub()` #

Replaces occurrences of a pattern with a specified string.

pattern = r"\s+"  # Matches one or more spaces
text = "Python    is   fun."
result = re.sub(pattern, " ", text)
print(result)  # Output: 'Python is fun.'

3. Advanced Topics #

3.1 Groups and Capturing #

Groups are created using parentheses () and can capture parts of a match.

pattern = r"(\d{3})-(\d{3})-(\d{4})"  # Matches phone numbers
text = "Call me at 123-456-7890."
match = re.search(pattern, text)
if match:
    print(match.group(0))  # Full match: 123-456-7890
    print(match.group(1))  # First group: 123
    print(match.group(2))  # Second group: 456
    print(match.group(3))  # Third group: 7890

3.2 Non-Capturing Groups #

Non-capturing groups use (?:...) and do not capture text for back-references.

pattern = r"(?:Hello|Hi), (\w+)"
text = "Hello, Alice"
match = re.search(pattern, text)
if match:
    print(match.group(1))  # Output: Alice

3.3 Lookahead and Lookbehind #

Lookahead and lookbehind assertions are used to match patterns based on context.

Positive Lookahead #

pattern = r"\w+(?=@example\.com)"
text = "user@example.com"
match = re.search(pattern, text)
print(match.group())  # Output: user

Negative Lookbehind #

pattern = r"(?<!\$)\d+"
text = "Price: $50, Discount: 20"
matches = re.findall(pattern, text)
print(matches)  # Output: ['20']

4. Flags in Regex #

Flags modify the behavior of regex patterns.

Flag	Description
`re.IGNORECASE`	Makes the pattern case-insensitive.
`re.MULTILINE`	Enables multi-line matching.
`re.DOTALL`	Allows `.` to match newline characters.
`re.VERBOSE`	Allows whitespace and comments in the pattern.

Example with Flags: #

pattern = r"(?i)hello"
text = "Hello, world!"
match = re.search(pattern, text)
print(bool(match))  # Output: True

5. Common Pitfalls #

Greedy vs Non-Greedy Matching:
- Greedy: .* matches as much text as possible.
- Non-Greedy: .*? matches as little text as possible.
text = "<tag>content</tag>" greedy = re.search(r"<.*>", text) non_greedy = re.search(r"<.*?>", text) print(greedy.group()) # Output: <tag>content</tag> print(non_greedy.group()) # Output: <tag>
Escaping Special Characters: Use \ to escape characters like ., *, +, etc.
Overuse: Avoid complex regex patterns when simpler string methods suffice.

6. Best Practices #

Test Patterns: Use tools like regex101 to test your patterns.
Keep it Simple: Use comments and re.VERBOSE for readability.
Optimize Performance: Avoid overly complex patterns for large text processing.

Conclusion #

Regex is a versatile tool for text processing and pattern matching in Python. While it can handle complex scenarios, understanding its syntax, capabilities, and limitations is essential for effective use.

Bigdata – Knowledge Base