Extracting emails from a text is a common task in Python, especially when you are cleaning the data, or building a list of emails based on a text document.
In this tutorial, I will show you how to extract emails from text in Python using regular expression (RegEx).
Table of Contents
Extracting Emails from Text using RegEx
To extract emails from text in Python using RegEx, we'll use the re
module, which provides support for regular expressions. Here's a simple example:
import re def extract_emails(text): email_regex = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' return re.findall(email_regex, text) sample_text = "Please contact me at dminhvu.work@gmail.com or wisecode@gmail.com." emails = extract_emails(sample_text) print(emails)
This code snippet defines a function extract_emails
that:
- input is a string
text
, - output is a list of email addresses extracted from that string based on the RegEx pattern
\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b
.
You can run this by using the python extract_emails.py
command in the Terminal (or Command Prompt on Windows).
To understand why we can construct this email_regex
pattern, you can learn more here, it has the explanation section in the right hand side.
Extracting Emails from a Text File
To extract emails from a text file, we'll read the file's content into a string and then use the same extract_emails
function defined earlier.
# ...def extract_emails_from_file(file_path): with open(file_path, 'r') as file: content = file.read() return extract_emails(content) file_path = 'example.txt' emails = extract_emails_from_file(file_path) print(emails)
In this example, extract_emails_from_file
reads the entire content of the file located at file_path
and then uses the extract_emails
function to find all email addresses.
The full code will be:
import re def extract_emails(text): email_regex = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' return re.findall(email_regex, text) def extract_emails_from_file(file_path): with open(file_path, 'r') as file: content = file.read() return extract_emails(content) file_path = 'example.txt' emails = extract_emails_from_file(file_path) print(emails)
Handling Large Text Files
When dealing with large text files, reading the entire file into memory might not be feasible. In such cases, we can process the file line by line:
# ...def extract_emails_from_large_file(file_path): emails = [] with open(file_path, 'r') as file: for line in file: emails.extend(extract_emails(line)) return emails file_path = 'large_example.txt' emails = extract_emails_from_large_file(file_path) print(emails)
This function iterates over each line in the file, extracts emails from that line, and appends them to the emails
list. This approach is more memory-efficient for large files.
Conclusion
In this tutorial, we've learned how to extract email addresses from strings and text files using Python.
In general, to extract email addresses from strings in Python:
- Use the
re
package to extract emails, - with the
\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b
RegEx pattern to math the email pattern.
If you find the pattern does not work, please comment below so I will fix it. Thank you!