Python Regex Guide

Regular expressions (also known as regex or regexp) are a sequence of characters that define a search pattern. They are a powerful tool used to match patterns in text, allowing you to search for and extract specific information from a string of text.

Regular Expressions

In a regular expression, you can use metacharacters to define rules for matching patterns. These metacharacters include special characters and symbols that have a specific meaning. For example, the period “.” character in a regular expression matches any single character, and the asterisk “*” character matches zero or more occurrences of the preceding character.

Regular expressions (regex) are a powerful tool that can be used in a wide range of applications for matching patterns in text. Some popular applications of regex include:

  1. Data validation: Regular expressions can be used to validate user input, such as validating email addresses, phone numbers, or credit card numbers.
  2. Text parsing: Regular expressions can be used to extract specific information from a text string, such as finding all instances of a particular word or phrase, or extracting data from structured text data formats like CSV files.
  3. Web scraping: Regular expressions can be used to scrape data from websites by searching for specific patterns in the HTML or XML code.
  4. Search and replace: Regular expressions can be used to search for and replace text in a document or file, which can be especially useful when dealing with large amounts of text.
  5. Programming: Regular expressions are widely used in programming languages like Python, JavaScript, and Perl for tasks like string manipulation, text processing, and data analysis.
  6. Command-line tools: Many command-line tools, such as grep and sed in Unix-based systems, support regular expressions for searching and manipulating text files.
Getting Started

To use regular expressions in Python, you first need to import the re module. Here’s an example:

import re
Basic Syntax

The basic syntax for a regular expression is a pattern that you want to match in a string. For example, the regular expression cat will match any string that contains the word “cat”.

import re

pattern = "cat"
text = "The cat is black and white."

match = re.search(pattern, text)

if match:
    print("Match found!")
else:
    print("Match not found.")

This will output “Match found!” because the word “cat” is present in the text string.

Metacharacters

In addition to literal text, regular expressions can use metacharacters to match patterns. Some commonly used metacharacters are:

  • .: matches any single character except a newline
  • *: matches zero or more occurrences of the preceding character
  • +: matches one or more occurrences of the preceding character
  • ?: matches zero or one occurrence of the preceding character
  • |: matches either the expression before or after the pipe symbol
  • []: matches any one character within the brackets
  • (): groups expressions together

Here are some examples:

import re

# match any string that starts with "cat"
pattern = "^cat"

# match any string that ends with "cat"
pattern = "cat$"

# match any string that contains "cat" followed by any single character
pattern = "cat."

# match any string that contains "ca" followed by zero or more "t" characters
pattern = "ca*t"

# match any string that contains "ca" followed by one or more "t" characters
pattern = "ca+t"

# match any string that contains "cat" or "dog"
pattern = "cat|dog"

# match any string that contains a lowercase vowel
pattern = "[aeiou]"

# match any string that contains the characters "cat" or "dog"
pattern = "(cat|dog)"
Using re.search() and re.findall()

The re.search() function returns a match object if the pattern is found in the string, or None if the pattern is not found.

import re

pattern = "cat"
text = "The cat is black and white."

match = re.search(pattern, text)

if match:
    print("Match found!")
else:
    print("Match not found.")

The re.findall() function returns a list of all non-overlapping matches of the pattern in the string.

import re

pattern = "cat"
text = "The cat is black and white. Another cat is sleeping."

matches = re.findall(pattern, text)

print(matches)

This will output ['cat', 'cat'] because there are two occurrences of the word “cat” in the text string.

Using re.sub()

The re.sub() function can be used to replace parts of a string that match a pattern with a new string.

import re

pattern = "cat"
text = "The cat is black and white. Another cat is sleeping."

new_text = re.sub(pattern, "dog", text)

print(new_text)

This will output “The dog is black and white. Another dog is sleeping.”

Other Python Regex examples

1. Find words between two strings in a sentence

Here’s an example code snippet that shows how to find all words between two strings in a sentence using look around assertions:

import re

sentence = "The quick brown fox jumps over the lazy dog."
start_word = "quick"
end_word = "dog"

pattern = r'(?<=' + start_word + r')\s+\w+\s+(?=' + end_word + r')'

matches = re.findall(pattern, sentence)

print(matches)

In this example, we define a regular expression pattern that matches any word that is preceded by the start_word and followed by the end_word, with one or more whitespace characters in between. The pattern uses positive look behind (?<=start_word) to ensure that the start word is present before the match, and positive lookahead (?=end_word) to ensure that the end word is present after the match.

The re.findall() function is used to find all occurrences of the pattern in the sentence string, and return a list of all matches. In this case, the output would be ['brown fox jumps over the lazy'], which is the string between “quick” and “dog”.

Note that we use the r prefix before the regular expression pattern to indicate that it is a raw string, which allows us to use backslashes without them being interpreted as escape characters.

2. To extract data from structured text data formats like CSV files

Here’s an example code snippet that shows how to extract data from a CSV file using regular expressions:

import re

# Define the regular expression pattern for a CSV file
pattern = r'((?:(?:"[^"]*")|[^,])*)'

# Open the CSV file and read its contents
with open('data.csv') as file:
    contents = file.read()

# Use the re.findall() function to find all matches of the pattern in the CSV contents
matches = re.findall(pattern, contents)

# Print the matches to the console
for match in matches:
    print(match)

In this example, we define a regular expression pattern that matches a CSV file. The pattern uses a non-capturing group (?:(?:"[^"]*")|[^,])* to match each field in the CSV file. The group matches either a string enclosed in double quotes (i.e. "[^"]*"), or any character that is not a comma (i.e. [^,]). The * quantifier at the end of the group matches zero or more occurrences of the group.

We then open the CSV file and read its contents, and use the re.findall() function to find all occurrences of the pattern in the CSV contents. The findall() function returns a list of all matches, with each match represented as a tuple of strings that correspond to the fields in the CSV file.

Finally, we loop over the matches and print them to the console. In this case, the output would be a list of tuples, where each tuple represents a row in the CSV file.

Further readings

If you are interested in learning more about regular expressions, here are some resources that can help you:

  1. Regular-Expressions.info: This website provides a comprehensive tutorial on regular expressions, including a quick-start guide, reference material, and examples in various programming languages.
  2. Mastering Regular Expressions: This book by Jeffrey Friedl is a comprehensive guide to regular expressions, covering everything from the basics to advanced topics like lookahead and back references.
  3. Python’s re module documentation: The official documentation for Python’s re module provides a detailed reference for all the regular expression functions and metacharacters available in Python.
  4. Regular Expressions Cookbook: This book by Jan Goyvaerts and Steven Levithan provides practical solutions for common regular expression tasks, with examples in multiple programming languages.
  5. regex101.com: This website allows you to test regular expressions and see how they match against sample text. It supports multiple programming languages and provides detailed explanations of the regular expressions used.
  6. Regular Expressions 101: This website also allows you to test regular expressions, and provides a quick reference guide for the most commonly used metacharacters.

These resources can help you get started with regular expressions and improve your skills over time.

Leave a Reply

Your email address will not be published. Required fields are marked *