Recommended Useful Python Crawler Library - RoboBrowser Usage Tutorial

1. Introduction
Today, I recommend a niche lightweight crawler library: RoboBrowser – Your friendly neighborhood web scraper! Written entirely in Python, it runs without requiring a separate browser. It can not only perform web scraping but also achieve web automation.

Project Address: https://github.com/jmcarp/robobrowser

2. Installation and Usage
Before practical application, let’s first install the dependency library and parser.
PS: The officially recommended parser is “lxml”

bash

# Install dependency
pip3 install robobrowser

# lxml parser (officially recommended)
pip3 install lxml

The two common functions of RoboBrowser are:

Simulating Form submissions
Web data scraping

For web data scraping using RoboBrowser, the three common methods are:

find: Query the first element matching the condition on the current page
find_all: Query a list of elements with common attributes on the current page
select: Query the page through CSS selector, returning a list of elements

It’s important to note that RoboBrowser depends on BS4, so its usage methods are similar to BS4.

For more features, refer to: https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

3. Practical Example
Let’s take “Baidu search and crawling search result lists” as an example.

3.1 Open Target Website
First, we instantiate a RoboBrowser object:

python

from time import sleep
from robobrowser import RoboBrowser

home_url = 'https://baidu.com'

# parser: Parser, HTML parser; used by BeautifulSoup
# Officially recommended: lxml
rb = RoboBrowser(history=True, parser='lxml')

# Open target website
rb.open(home_url)

Then, use the open() method of the RoboBrowser instance object to open the target website.

3.2 Automated Form Submission
First, use the RoboBrowser instance object to get the Form in the webpage. Then, simulate input operations by assigning values to the input fields in the form. Finally, use the submit_form() method to submit the form, simulating a search operation.

python

# Get form object
bd_form = rb.get_form()

print(bd_form)

bd_form['wd'].value = "AirPython"

# Submit form, simulate a search
rb.submit_form(bd_form)

3.3 Data Scraping
Analyze the webpage structure of the search results page, use the select() method in RoboBrowser to match all search list elements. Iterate through the search list elements, and use the find() method to query the title and href link address for each item.

python

# View results
result_elements = rb.select(".result")

# Search results
search_result = []

# Link address of the first item
first_href = ''

for index, element in enumerate(result_elements):
    title = element.find("a").text
    href = element.find("a")['href']
    search_result.append(title)

    if index == 0:
        first_href = element.find("a")
        print('First item address:', href)

print(search_result)

Finally, use the follow_link() method in RoboBrowser to simulate “clicking a link to view webpage details”:

python

# Jump to the first link
rb.follow_link(first_href)

# Get history
print(rb.url)

Note: The parameter of the follow_link() method is an a tag with an href value.

4. Conclusion
This article, using the Baidu search example, demonstrated how to use RoboBrowser to complete an automation and crawling operation. Compared to Selenium, Helium, etc., RoboBrowser is more lightweight and doesn’t depend on a separate browser and driver.

If you need to handle simple crawling or web automation tasks, RoboBrowser is fully capable. However, for complex automation scenarios, it’s recommended to use Selenium, Pyppeteer, Helium, etc.

Easy Python

Recommended Useful Python Crawler Library – RoboBrowser Usage Tutorial

New Article

Related articles