Do have a large number of website URLs and want to scrape all emails and phone numbers from these URLs?
This tutorial will guide you to scrape all emails and phone numbers with Python and BeautifulSoup scripts.
Scraping contact details from a single website can be done manually without much difficulty however, when there are a large number of websites to scrape, the process can become time-consuming. A solution to this problem is to use a Python script.
In particular, using a Python script allows you to scrape data from thousands of URLs in a matter of minutes. This saves a significant amount of time and effort compared to manual scraping.
The script will use regular expressions to identify patterns that match the structure of email addresses and phone numbers. If the pattern is matched, the script will extract the relevant information from the website. By the end of this tutorial, you will have a working script that can quickly and efficiently scrape email and phone numbers from a website.
So, let’s begin!
Table of Contents
- Create a Python file
- Build Regular Expressions for Phone Numbers
- Build Regular Expression for Email ID
- Source code to read the URL from the CSV file
- Source Code to find all phone numbers and emails
- Running the program
Step 1: Create a Python file
Firstly, create a new Python file called email_phone_scrap.py
. Then, import the necessary libraries that your program will need. Your program should look like below code block:
# email_phone_scrap.py - Scrap email and phone number from given websites.
import csv # for reading/writing in CSV file
import re # for regular expressions
import requests # for opening web page
Step 2: Build Regular Expressions for Phone Numbers
To search for phone numbers, you will need to create a regular expression.
First, let’s understand the structure of a typical phone number, which consists of three parts: the area code (usually three digits), the first three digits, and the last four digits. These parts are typically separated by a symbol, such as a hyphen or a space. For example 122-456-7890.
To create a regular expression for this pattern, you can use the following code:
Phone Numbers Pattern | Regular Expression |
---|---|
The extension (if any) | (\d{3}|\(\d{3}\))? |
Separator [- or .] (may or may not present) | (\s|-|\.)? |
First 3 digits | (\d{3}) |
Separator | (\s|-|\.) |
Last 4 digits | (\d{4}) |
Extension (if any) | (\s*(ext|x|ext.)\s*(\d{2,5}))? |
Let’s put it together to create a regex for the phone number.
# email_phone_scrap.py - Scrap email and phone number from given websites.
import csv # for reading/writing in CSV file
--snip--
# Create phone number regular expression
phone_regex = re.compile(r'''(
(\d{3}|\(\d{3}\))?
(\s|-|\.)?
(\d{3})
(\s|-|\.)
(\d{4})
(\s*(ext|x|ext.)\s*(\d{2,5}))?)''', re.VERBOSE)
Note: The re.VERBOSE
is used to write comments in regular expressions.
If you are having difficulty understanding this code, then you must learn some basics of Python programming.
Step 3: Build Regular Expression for Email ID
Next, let’s move on to creating a regular expression to match the Email ID pattern.
When creating a regular expression for email addresses, you will need to consider the different parts of an email address. Typically, an email address consists of four main components:: the username, an @ symbol, the domain name, and a suffix (such as .com or .edu). For instance, an example email address could be contact@vpktechnologies.com.
Email Pattern | Regular Expression |
---|---|
User name | [a-zA-Z0-9._%+-]+ |
@ symbol | @ |
Domain name | [a-zA-Z0-9.-]+ |
Dot and something | (\.[a-zA-Z]{2,4}) |
Let’s put it together to create a regex for email id.
# email_phone_scrap.py – Scrap email and phone number from given websites.
import csv # for reading/writing in CSV file
--snip—
# Create phone regular expression
phone_regex = re.compile(r'''(
--snip--
# Create email id regular expression
email_regex = re.compile(r'''(
[a-zA-Z0-9._%+-]+
@
[a-zA-Z0-9.-]+
(\.[a-zA-Z]{2,4}))''', re.VERBOSE)
Step 4: Source code to read the URL from the CSV file
Create a new CSV file with the name website_urls.csv
and put all website URLs in column A. Store this CSV file in the same directory where email_phone_scrap.py
is saved.
Next, create Python code to read the URL from the CSV file. Add the following code in the email_phone_scrap.py
.
# email_phone_scrap.py – Scrap email and phone number from given websites.
import csv # for reading/writing in CSV file
--snip—
# Create phone number regular expression
phone_regex = re.compile(r'''(
--snip--
# Create email id regular expression
email_regex = re.compile(r'''(
--snip--
# Open URLs from CSV file and
with open("website_urls.csv") as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
page_url = row[0]
print("Opening URL:", page_url)
page_data = requests.get(page_url)
page_html = str(page_data.content)
Step 5: Source code to find all phone numbers and emails
The script will open webpages given in CSV files one by one. To find all phone numbers and emails, use the above-created regular expression code in our script.
# email_phone_scrap.py – Scrap email and phone number from given websites.
import csv # for reading/writing in CSV file
--snip--
# Create phone number regular expression
phone_regex = re.compile(r'''(
--snip--
# Create email id regular expression
email_regex = re.compile(r'''(
--snip--
# Open URLs from CSV file and
with open("website_urls.csv") as csv_file:
--snip--
matches = []
for groups in phone_regex.findall(page_html):
phone_numbers = '-'.join([groups[1], groups[3], groups[5]])
if groups[8] != '':
phone_numbers += ' x' + groups[8]
matches.append(phone_numbers)
for groups in email_regex.findall(page_html):
matches.append(groups[0])
print('\n'.join(matches))
The complete source code should look like this.
# email_phone_scrap.py - Scrap email and phone number from given websites.
import csv # for reading/writing in CSV file
import re # for regular expressions
import requests # for opening web page
# Create phone number regular expression
phone_regex = re.compile(r'''(
(\d{3}|\(\d{3}\))?
(\s|-|\.)?
(\d{3})
(\s|-|\.)
(\d{4})
(\s*(ext|x|ext.)\s*(\d{2,5}))?)''', re.VERBOSE)
# Create email id regular expression
email_regex = re.compile(r'''(
[a-zA-Z0-9._%+-]+
@
[a-zA-Z0-9.-]+
(\.[a-zA-Z]{2,4}))''', re.VERBOSE)
# Open URLs from CSV file and
with open("website_urls.csv") as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
page_url = row[0]
print("Opening URL:", page_url)
page_data = requests.get(page_url)
# Convert byte data to a string
page_html = str(page_data.content)
matches = []
for groups in phone_regex.findall(page_html):
phone_numbers = '-'.join([groups[1], groups[3], groups[5]])
if groups[8] != '':
phone_numbers += ' x' + groups[8]
matches.append(phone_numbers)
for groups in email_regex.findall(page_html):
matches.append(groups[0])
print('\n'.join(matches))
Step 6: Running the program
To make things simpler, we will only include one URL in the CSV file for this example. When you run the program, the output should look similar to the following:
Opening URL: https://nostarch.com/contactus/
-010-4093
800-420-7240
415-863-9900
415-863-9950
support@nostarch.com
academic@nostarch.com
sales@nostarch.com
conferences@nostarch.com
errata@nostarch.com
support@nostarch.com
academic@nostarch.com
sales@nostarch.com
conferences@nostarch.com
errata@nostarch.com
info@nostarch.com
media@nostarch.com
editors@nostarch.com
rights@nostarch.com
support@nostarch.com
academic@nostarch.com
This output shows the phone numbers and email addresses that were found on the webpage. If you had multiple URLs in the CSV file, the program would scrape all of them and display the results in the same way.
Related: See our guide on How to Build a Multi-Threaded Web Scraper in Python if you want to create a multi-threaded scraper in Python.
Conclusion
This code efficiently extracts phone numbers and email IDs from the provided URL. You can put any number of URLs to extract this information. Additionally, if there are any changes in the format of the phone number, such as the absence of a separator like a hyphen, the code can handle it seamlessly.
To ensure simplicity, the extracted data is currently printed on the console. However, it can be easily modified to save the scraped data in a CSV file or any other desired file format.
The above code is specifically designed to extract phone numbers and email addresses from a given URL. You have the flexibility to include as many URLs as you like in the CSV file, and the program will scrape the data from all of them. If the format of the phone numbers or email addresses on the webpage changes (for example, if the hyphens are removed from the phone numbers), the script can effortlessly adapt to match the new format.