Recommended Useful Python Crawler Library – RoboBrowser Usage Tutorial
1. Introduction
Today, I recommend a niche lightweight crawler library: RoboBrowser – Your friendly neighborhood web scraper! Written entirely in Python, it runs without requiring a separate browser. It can not only perform web scraping but also achieve web automation.
Project Address: https://github.com/jmcarp/robobrowser
2. Installation and Usage
Before practical application, let’s first install the dependency library and parser.
PS: The officially recommended parser is “lxml”
bash
# Install dependency pip3 install robobrowser # lxml parser (officially recommended) pip3 install lxml
The two common functions of RoboBrowser are:
- Simulating Form submissions
- Web data scraping
For web data scraping using RoboBrowser, the three common methods are:
find: Query the first element matching the condition on the current pagefind_all: Query a list of elements with common attributes on the current pageselect: Query the page through CSS selector, returning a list of elements
It’s important to note that RoboBrowser depends on BS4, so its usage methods are similar to BS4.
For more features, refer to: https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
3. Practical Example
Let’s take “Baidu search and crawling search result lists” as an example.
3.1 Open Target Website
First, we instantiate a RoboBrowser object:
python
from time import sleep from robobrowser import RoboBrowser home_url = 'https://baidu.com' # parser: Parser, HTML parser; used by BeautifulSoup # Officially recommended: lxml rb = RoboBrowser(history=True, parser='lxml') # Open target website rb.open(home_url)
Then, use the open() method of the RoboBrowser instance object to open the target website.
3.2 Automated Form Submission
First, use the RoboBrowser instance object to get the Form in the webpage. Then, simulate input operations by assigning values to the input fields in the form. Finally, use the submit_form() method to submit the form, simulating a search operation.
python
# Get form object bd_form = rb.get_form() print(bd_form) bd_form['wd'].value = "AirPython" # Submit form, simulate a search rb.submit_form(bd_form)
3.3 Data Scraping
Analyze the webpage structure of the search results page, use the select() method in RoboBrowser to match all search list elements. Iterate through the search list elements, and use the find() method to query the title and href link address for each item.
python
# View results
result_elements = rb.select(".result")
# Search results
search_result = []
# Link address of the first item
first_href = ''
for index, element in enumerate(result_elements):
title = element.find("a").text
href = element.find("a")['href']
search_result.append(title)
if index == 0:
first_href = element.find("a")
print('First item address:', href)
print(search_result)
Finally, use the follow_link() method in RoboBrowser to simulate “clicking a link to view webpage details”:
python
# Jump to the first link rb.follow_link(first_href) # Get history print(rb.url)
Note: The parameter of the follow_link() method is an a tag with an href value.
4. Conclusion
This article, using the Baidu search example, demonstrated how to use RoboBrowser to complete an automation and crawling operation. Compared to Selenium, Helium, etc., RoboBrowser is more lightweight and doesn’t depend on a separate browser and driver.
If you need to handle simple crawling or web automation tasks, RoboBrowser is fully capable. However, for complex automation scenarios, it’s recommended to use Selenium, Pyppeteer, Helium, etc.
Related articles