Introducing a Web Scraping Framework That Can Replace Scrapy - feapder

1. Introduction

As we all know, Python’s most popular web scraping framework is Scrapy, which is mainly used for scraping structured data from websites.

Today, I recommend a simpler, lightweight, yet powerful web scraping framework: feapder

Project Address: https://github.com/Boris-code/feapder

2. Introduction and Installation

Similar to Scrapy, feapder supports lightweight spiders, distributed spiders, batch spiders, spider alert mechanisms, and other functions.

The 3 built-in spider types are:

AirSpider: Lightweight spider, suitable for simple scenarios and small data volume scraping
Spider: Distributed spider, based on Redis, suitable for massive data, and supports breakpoint resumption, automatic data storage, and other functions
BatchSpider: Distributed batch spider, mainly used for periodically collected data

Before practical application, we install the corresponding dependency library in a virtual environment:

bash

# Install dependency library
pip3 install feapder

3. Practical Example

Let’s use the simplest AirSpider to scrape some simple data.

Target website: aHR0cHM6Ly90b3BodWIudG9kYXkvIA==

Detailed implementation steps (5 steps):

3.1 Create Spider Project

First, we use the feapder create -p command to create a spider project:

bash

# Create a spider project
feapder create -p tophub_demo

3.2 Create AirSpider

Navigate to the spiders folder in the command line and use the feapder create -s command to create a spider:

bash

cd spiders

# Create a lightweight spider
feapder create -s tophub_spider 1

Where:

1 is the default, representing creating a lightweight spider AirSpider
2 represents creating a distributed spider Spider
3 represents creating a distributed batch spider BatchSpider

3.3 Configure Database, Create Data Table, Create Mapping Item

Using MySQL as an example, first we create a data table in the database:

sql

# Create a data table
create table topic
(
    id         int auto_increment
        primary key,
    title      varchar(100)  null comment 'Article title',
    auth       varchar(20)   null comment 'Author',
    like_count     int default 0 null comment 'Like count',
    collection int default 0 null comment 'Collection count',
    comment    int default 0 null comment 'Comment count'
);

Then, open the settings.py file in the project root directory and configure the database connection information:

python

# settings.py

MYSQL_IP = "localhost"
MYSQL_PORT = 3306
MYSQL_DB = "xag"
MYSQL_USER_NAME = "root"
MYSQL_USER_PASS = "root"

Finally, create a mapping Item (optional):

Navigate to the items folder and use the feapder create -i command to create a file mapping to the database.

PS: Since AirSpider doesn’t support automatic data storage, this step is not mandatory.

3.4 Write Spider and Data Parsing

Step 1: First initialize the database using MysqlDB:

python

from feapder.db.mysqldb import MysqlDB

class TophubSpider(feapder.AirSpider):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.db = MysqlDB()

Step 2: In the start_requests method, specify the main URL to scrape, and use the keyword download_midware to configure random UA:

python

import feapder
from fake_useragent import UserAgent

def start_requests(self):
    yield feapder.Request("https://tophub.today/", download_midware=self.download_midware)

def download_midware(self, request):
    # Random UA
    # Dependency: pip3 install fake_useragent
    ua = UserAgent().random
    request.headers = {'User-Agent': ua}
    return request

Step 3: Scrape homepage titles and URLs

Use feapder’s built-in xpath method to parse data:

python

def parse(self, request, response):
    # print(response.text)
    card_elements = response.xpath('//div[@class="cc-cd"]')

    # Filter out corresponding card elements ["What's Worth Buying"]
    buy_good_element = [card_element for card_element in card_elements if
                        card_element.xpath('.//div[@class="cc-cd-is"]//span/text()').extract_first() == 'What\'s Worth Buying'][0]

    # Get internal article titles and addresses
    a_elements = buy_good_element.xpath('.//div[@class="cc-cd-cb nano"]//a')

    for a_element in a_elements:
        # Title and link
        title = a_element.xpath('.//span[@class="t"]/text()').extract_first()
        href = a_element.xpath('.//@href').extract_first()

        # Issue new task again, carrying article title
        yield feapder.Request(href, download_midware=self.download_midware, callback=self.parser_detail_page,
                              title=title)

Step 4: Scrape detail page data

The previous step issues a new task, specifies the callback function via the callback keyword, and finally parses the detail page data in parser_detail_page:

python

def parser_detail_page(self, request, response):
    """
    Parse article detail data
    :param request:
    :param response:
    :return:
    """
    title = request.title

    url = request.url

    # Parse article detail page, get like, collection, comment counts and author name
    author = response.xpath('//a[@class="author-title"]/text()').extract_first().strip()

    print("Author:", author, 'Article title:', title, "URL:", url)

    desc_elements = response.xpath('//span[@class="xilie"]/span')

    print("Description count:", len(desc_elements))

    # Likes
    like_count = int(re.findall('\d+', desc_elements[1].xpath('./text()').extract_first())[0])
    # Collections
    collection_count = int(re.findall('\d+', desc_elements[2].xpath('./text()').extract_first())[0])
    # Comments
    comment_count = int(re.findall('\d+', desc_elements[3].xpath('./text()').extract_first())[0])

    print("Likes:", like_count, "Collections:", collection_count, "Comments:", comment_count)

3.5 Data Storage

Use the previously instantiated database object to execute SQL and insert data into the database:

python

# Insert into database
sql = "INSERT INTO topic(title,auth,like_count,collection,comment) values('%s','%s','%s','%d','%d')" % (
title, author, like_count, collection_count, comment_count)

# Execute
self.db.execute(sql)

4. Conclusion

This article discussed the simplest spider in feapder, AirSpider, through a basic example. Regarding the use of feapder’s advanced features, I will provide detailed explanations through a series of examples in the future.

Easy Python

Introducing a Web Scraping Framework That Can Replace Scrapy – feapder

New Article

Related articles