If you have thousands of URLs to scrape, using a simple Python script would be time-consuming.
There is a solution to build a multi-threaded Python script capable of simultaneously scraping multiple URLs, thereby completing the task more efficiently.
In this article, we will guide you to build a simple multithreaded scraper/crawler using Python with the BeautifulSoup library to scrape books from https://books.toscrape.com/ website.
Which libraries we will need?
BeautifulSoup: BeautifulSoup is a Python library that makes it easy to scrape information from any web page. It is at the top of an HTML or XML parser, providing Pythonic ways for iterating, searching, and modifying the parse tree. To install this library, type the following command in the IDE/terminal.
pip install beautifulsoup4
requests: Requests is a simple, yet elegant, HTTP library. This library allows you to send HTTP/1.1 requests very easily. Install this library using the following command.
pip install requests
You also need to install lxml parser which will be used to parse data.
pip install lxml
A stepwise implementation of the script
Follow these steps to build a multithreaded scraping script with BeautifulSoup and Python.
- Import all required libraries
- Create the main program and object of the class MultiThreadedScraper
- MultiThreadedScraper class implementation
- Scrape all URLs in the given source URL
- Create scrape_pages function which will scrape the required data
- Call ThreadPoolExecutor to run the scraper
Step 1: Import all required libraries
We will first need to import all the libraries that are needed to build the script. If you’re using Python3, you have already installed the libraries except for BeautifulSoup, requests, and lxml. So, if you haven’t installed these three libraries, install them using the commands given above.
import re
import requests
from bs4 import BeautifulSoup
import concurrent.futures
from urllib.parse import urljoin, urlparse
Step 2: Create the main program and object of the class MultiThreadedScraper
Next, create the main block with if name == 'main':
. This block will tell Python to start execution from this block.
Inside this block, create an object for the class MultiThreadedScraper
and pass the website URL that you want to scrape.
After creating a class object we need to call the function scrap_urls()
which scraps all URLs inside the given URL page. Basically, this function is just a URL scraper which we will need when we implement the actual scraping function.
After scraping all URLs, call the function start_scraper()
to scrape the Title and URL of the blog using a multithreaded way.
if __name__ == '__main__':
cc = MultiThreadedScraper("https://books.toscrape.com/")
cc.scrape_urls()
cc.start_scraper()
Step 3: MultiThreadedScraper class implementation
Let’s implement MultiThreadedScraper class with some required data members.
def __init__(self, source_url):
self.source_url = source_url
self.root_url = (
f'{urlparse(self.source_url).scheme}:'
f'//{urlparse(self.source_url).netloc}'
)
self.all_urls = set([])
self.total_threads = 10
In the above class implementation, we have the following data members of the class.
source_url
: It is the URL we have passed from the main function. It is the main URL that contains the list of books that need to be extracted.root_url
: It is the root URL we have extracted usingsource_url
from the website.all_urls
: It is an array of URLs that we have scraped fromsource_url
. Note that we defined set because it will remove duplicate URLs, if found.total_threads
: It is the total number of threads that will run concurrently. We have used 10 threads, but you can always increase them according to your needs.
Step 4: Scrape all URLs in the given source URL
def scrape_urls(self):
"""Scrape all urls present in given home page url"""
page_data = requests.get(self.source_url).content
soup = BeautifulSoup(page_data, 'lxml')
all_urls = soup.select('article.product_pod h3 a')
# Extract only books urls
for link in all_urls:
url = urljoin(self.root_url, link['href'])
self.all_urls.add(url)
In the above code implementation, we first open source_url
using the request library and get HTML content then save it in the page_data
variable. Then it creates a BeautifulSoup object by data from variable page_data
.
After that, we need to scrape all required URLs using the select
method of BeautifulSoup and store them in all_urls
. But there is HTML data in this variable, so to get only HREF values, we need to loop through all elements to extract only URL and saved it in all_urls
.
Step 5: Create scrape_pages
function which will scrape the required data
def scrape_pages(self, page_url):
page_data = requests.get(page_url).content
soup = BeautifulSoup(page_data, 'lxml')
# Scrape book title
title = soup.select('h1')[0].text.strip()
# Scrape price
price = soup.select('p.price_color')[0].text.strip()
# Scrape total stocks
quantity = soup.select('p.instock')[0].text.strip()
match = re.search(r'\b((\d+)\b)', quantity)
if match:
quantity = int(match.group(1)) # Extract number as integer
else:
quantity = 0
print(
f'URL: {page_url}, Title: {title}, '
f'Price: {price}, Quantity: {quantity}'
)
In the above code, we passed page_url
which contains all the required data needed. We open the given URL using the request library and using the BeautifulSoup library to scrape data.
Step 6: Call ThreadPoolExecutor to run the scraper
def start_scraper(self):
"""Begin concurrent scraper"""
with concurrent.futures.ThreadPoolExecutor(self.total_threads) as executor:
executor.map(self.scrape_pages, self.all_urls)
At last, create the function start_scraper
to create ThreadPoolExecutor
and pass all URLs we scrapped earlier.
The complete code is given below.
When you run the above code, the output will look like the below:
URL: https://books.toscrape.com/catalogue/set-me-free_988/index.html, Title: Set Me Free, Price: £17.46, Quantity: 19
URL: https://books.toscrape.com/catalogue/sharp-objects_997/index.html, Title: Sharp Objects, Price: £47.82, Quantity: 20
URL: https://books.toscrape.com/catalogue/the-requiem-red_995/index.html, Title: The Requiem Red, Price: £22.65, Quantity: 19
URL: https://books.toscrape.com/catalogue/olio_984/index.html, Title: Olio, Price: £23.88, Quantity: 19
Related: You should also check How to Scrape Email and Phone Number from Any Website with Python tutorial.