Introducing a Web Scraping Framework That Can Replace Scrapy – feapder
1. Introduction
As we all know, Python’s most popular web scraping framework is Scrapy, which is mainly used for scraping structured data from websites.
Today, I recommend a simpler, lightweight, yet powerful web scraping framework: feapder
Project Address: https://github.com/Boris-code/feapder
2. Introduction and Installation
Similar to Scrapy, feapder supports lightweight spiders, distributed spiders, batch spiders, spider alert mechanisms, and other functions.
The 3 built-in spider types are:
- AirSpider: Lightweight spider, suitable for simple scenarios and small data volume scraping
- Spider: Distributed spider, based on Redis, suitable for massive data, and supports breakpoint resumption, automatic data storage, and other functions
- BatchSpider: Distributed batch spider, mainly used for periodically collected data
Before practical application, we install the corresponding dependency library in a virtual environment:
bash
# Install dependency library pip3 install feapder
3. Practical Example
Let’s use the simplest AirSpider to scrape some simple data.
Target website: aHR0cHM6Ly90b3BodWIudG9kYXkvIA==
Detailed implementation steps (5 steps):
3.1 Create Spider Project
First, we use the feapder create -p command to create a spider project:
bash
# Create a spider project feapder create -p tophub_demo
3.2 Create AirSpider
Navigate to the spiders folder in the command line and use the feapder create -s command to create a spider:
bash
cd spiders # Create a lightweight spider feapder create -s tophub_spider 1
Where:
1is the default, representing creating a lightweight spider AirSpider2represents creating a distributed spider Spider3represents creating a distributed batch spider BatchSpider
3.3 Configure Database, Create Data Table, Create Mapping Item
Using MySQL as an example, first we create a data table in the database:
sql
# Create a data table
create table topic
(
id int auto_increment
primary key,
title varchar(100) null comment 'Article title',
auth varchar(20) null comment 'Author',
like_count int default 0 null comment 'Like count',
collection int default 0 null comment 'Collection count',
comment int default 0 null comment 'Comment count'
);
Then, open the settings.py file in the project root directory and configure the database connection information:
python
# settings.py MYSQL_IP = "localhost" MYSQL_PORT = 3306 MYSQL_DB = "xag" MYSQL_USER_NAME = "root" MYSQL_USER_PASS = "root"
Finally, create a mapping Item (optional):
Navigate to the items folder and use the feapder create -i command to create a file mapping to the database.
PS: Since AirSpider doesn’t support automatic data storage, this step is not mandatory.
3.4 Write Spider and Data Parsing
Step 1: First initialize the database using MysqlDB:
python
from feapder.db.mysqldb import MysqlDB
class TophubSpider(feapder.AirSpider):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.db = MysqlDB()
Step 2: In the start_requests method, specify the main URL to scrape, and use the keyword download_midware to configure random UA:
python
import feapder
from fake_useragent import UserAgent
def start_requests(self):
yield feapder.Request("https://tophub.today/", download_midware=self.download_midware)
def download_midware(self, request):
# Random UA
# Dependency: pip3 install fake_useragent
ua = UserAgent().random
request.headers = {'User-Agent': ua}
return request
Step 3: Scrape homepage titles and URLs
Use feapder’s built-in xpath method to parse data:
python
def parse(self, request, response):
# print(response.text)
card_elements = response.xpath('//div[@class="cc-cd"]')
# Filter out corresponding card elements ["What's Worth Buying"]
buy_good_element = [card_element for card_element in card_elements if
card_element.xpath('.//div[@class="cc-cd-is"]//span/text()').extract_first() == 'What\'s Worth Buying'][0]
# Get internal article titles and addresses
a_elements = buy_good_element.xpath('.//div[@class="cc-cd-cb nano"]//a')
for a_element in a_elements:
# Title and link
title = a_element.xpath('.//span[@class="t"]/text()').extract_first()
href = a_element.xpath('.//@href').extract_first()
# Issue new task again, carrying article title
yield feapder.Request(href, download_midware=self.download_midware, callback=self.parser_detail_page,
title=title)
Step 4: Scrape detail page data
The previous step issues a new task, specifies the callback function via the callback keyword, and finally parses the detail page data in parser_detail_page:
python
def parser_detail_page(self, request, response):
"""
Parse article detail data
:param request:
:param response:
:return:
"""
title = request.title
url = request.url
# Parse article detail page, get like, collection, comment counts and author name
author = response.xpath('//a[@class="author-title"]/text()').extract_first().strip()
print("Author:", author, 'Article title:', title, "URL:", url)
desc_elements = response.xpath('//span[@class="xilie"]/span')
print("Description count:", len(desc_elements))
# Likes
like_count = int(re.findall('\d+', desc_elements[1].xpath('./text()').extract_first())[0])
# Collections
collection_count = int(re.findall('\d+', desc_elements[2].xpath('./text()').extract_first())[0])
# Comments
comment_count = int(re.findall('\d+', desc_elements[3].xpath('./text()').extract_first())[0])
print("Likes:", like_count, "Collections:", collection_count, "Comments:", comment_count)
3.5 Data Storage
Use the previously instantiated database object to execute SQL and insert data into the database:
python
# Insert into database
sql = "INSERT INTO topic(title,auth,like_count,collection,comment) values('%s','%s','%s','%d','%d')" % (
title, author, like_count, collection_count, comment_count)
# Execute
self.db.execute(sql)
4. Conclusion
This article discussed the simplest spider in feapder, AirSpider, through a basic example. Regarding the use of feapder’s advanced features, I will provide detailed explanations through a series of examples in the future.
Related articles