How to Build Multi-Threaded Web Scraper in Python

If you have thousands of URLs to scrape, using a simple Python script would be time-consuming.

There is a solution to build a multi-threaded Python script capable of simultaneously scraping multiple URLs, thereby completing the task more efficiently.

In this article, we will guide you to build a simple multithreaded scraper/crawler using Python with the BeautifulSoup library to scrape books from https://books.toscrape.com/ website.

Which libraries we will need?

BeautifulSoup: BeautifulSoup is a Python library that makes it easy to scrape information from any web page. It is at the top of an HTML or XML parser, providing Pythonic ways for iterating, searching, and modifying the parse tree. To install this library, type the following command in the IDE/terminal.

pip install beautifulsoup4

requests: Requests is a simple, yet elegant, HTTP library. This library allows you to send HTTP/1.1 requests very easily. Install this library using the following command.

pip install requests

You also need to install lxml parser which will be used to parse data.

pip install lxml

A stepwise implementation of the script

Follow these steps to build a multithreaded scraping script with BeautifulSoup and Python.

Import all required libraries
Create the main program and object of the class MultiThreadedScraper
MultiThreadedScraper class implementation
Scrape all URLs in the given source URL
Create scrape_pages function which will scrape the required data
Call ThreadPoolExecutor to run the scraper

Step 1: Import all required libraries

We will first need to import all the libraries that are needed to build the script. If you’re using Python3, you have already installed the libraries except for BeautifulSoup, requests, and lxml. So, if you haven’t installed these three libraries, install them using the commands given above.

import re
import requests
from bs4 import BeautifulSoup
import concurrent.futures
from urllib.parse import urljoin, urlparse

Step 2: Create the main program and object of the class MultiThreadedScraper

Next, create the main block with if name == 'main':. This block will tell Python to start execution from this block.

Inside this block, create an object for the class MultiThreadedScraper and pass the website URL that you want to scrape.

After creating a class object we need to call the function scrap_urls() which scraps all URLs inside the given URL page. Basically, this function is just a URL scraper which we will need when we implement the actual scraping function.

After scraping all URLs, call the function start_scraper() to scrape the Title and URL of the blog using a multithreaded way.

if __name__ == '__main__':
    cc = MultiThreadedScraper("https://books.toscrape.com/")
    cc.scrape_urls()
    cc.start_scraper()

Step 3: MultiThreadedScraper class implementation

Let’s implement MultiThreadedScraper class with some required data members.

def __init__(self, source_url):    
	self.source_url = source_url
	self.root_url = (
		f'{urlparse(self.source_url).scheme}:'
		f'//{urlparse(self.source_url).netloc}'
	)
	self.all_urls = set([])
	self.total_threads = 10

In the above class implementation, we have the following data members of the class.

source_url: It is the URL we have passed from the main function. It is the main URL that contains the list of books that need to be extracted.
root_url: It is the root URL we have extracted using source_url from the website.
all_urls: It is an array of URLs that we have scraped from source_url. Note that we defined set because it will remove duplicate URLs, if found.
total_threads: It is the total number of threads that will run concurrently. We have used 10 threads, but you can always increase them according to your needs.

Step 4: Scrape all URLs in the given source URL

def scrape_urls(self):
	"""Scrape all urls present in given home page url"""
	page_data = requests.get(self.source_url).content
	soup = BeautifulSoup(page_data, 'lxml')

	all_urls = soup.select('article.product_pod h3 a')

	# Extract only books urls
	for link in all_urls:
		url = urljoin(self.root_url, link['href'])
		self.all_urls.add(url)

In the above code implementation, we first open source_url using the request library and get HTML content then save it in the page_data variable. Then it creates a BeautifulSoup object by data from variable page_data.

After that, we need to scrape all required URLs using the select method of BeautifulSoup and store them in all_urls. But there is HTML data in this variable, so to get only HREF values, we need to loop through all elements to extract only URL and saved it in all_urls.

Step 5: Create `scrape_pages` function which will scrape the required data

def scrape_pages(self, page_url):
	page_data = requests.get(page_url).content
	soup = BeautifulSoup(page_data, 'lxml')

	# Scrape book title
	title = soup.select('h1')[0].text.strip()

	# Scrape price
	price = soup.select('p.price_color')[0].text.strip()

	# Scrape total stocks
	quantity = soup.select('p.instock')[0].text.strip()

	match = re.search(r'\b((\d+)\b)', quantity)
	if match:
		quantity = int(match.group(1))  # Extract number as integer
	else:
		quantity = 0

	print(
		f'URL: {page_url}, Title: {title}, '
		f'Price: {price}, Quantity: {quantity}'
	)

In the above code, we passed page_url which contains all the required data needed. We open the given URL using the request library and using the BeautifulSoup library to scrape data.

Step 6: Call ThreadPoolExecutor to run the scraper

def start_scraper(self):
	"""Begin concurrent scraper"""
	with concurrent.futures.ThreadPoolExecutor(self.total_threads) as executor:
		executor.map(self.scrape_pages, self.all_urls)

At last, create the function start_scraper to create ThreadPoolExecutor and pass all URLs we scrapped earlier.

The complete code is given below.

When you run the above code, the output will look like the below:

URL: https://books.toscrape.com/catalogue/set-me-free_988/index.html, Title: Set Me Free, Price: £17.46, Quantity: 19
URL: https://books.toscrape.com/catalogue/sharp-objects_997/index.html, Title: Sharp Objects, Price: £47.82, Quantity: 20
URL: https://books.toscrape.com/catalogue/the-requiem-red_995/index.html, Title: The Requiem Red, Price: £22.65, Quantity: 19
URL: https://books.toscrape.com/catalogue/olio_984/index.html, Title: Olio, Price: £23.88, Quantity: 19

Related: You should also check How to Scrape Email and Phone Number from Any Website with Python tutorial.

Post Views: 14