Python Web Scraping in Action: Crawling Videos to Local Storage – A Super Detailed Practical Tutorial
ling
Before starting to write the crawler code, we need to complete tool installation and environment configuration, as well as understand the basic structure of website videos to avoid pitfalls.
1.1 Required Tools and Libraries
This tutorial requires the following Python libraries. Their functions are as follows:
requests: Sends HTTP requests to obtain webpage data and video resources.
BeautifulSoup4: Parses HTML pages to extract information such as video titles and links.
you-get: The core tool for crawling website videos. It can automatically handle encrypted video links and format conversion.
lxml: An efficient parser for XML/HTML, used in conjunction with BeautifulSoup.
os: Handles local file paths and creates directories for saving videos.
time: Controls the frequency of requests to avoid being blocked by anti-crawling mechanisms.
1.2 Environment Installation
Open the command line (CMD or PowerShell on Windows, Terminal on Mac/Linux) and execute the following commands to install the required libraries:
bash
# Install basic libraries
pip install requests beautifulsoup4 lxml
# Install the core crawling tool you-get (Key!)
pip install you-get
Note: If errors occur during the installation of you-get, it might be due to network issues. You can try using a domestic mirror source:
bash
pip install you-get -i https://pypi.tuna.tsinghua.edu.cn/simple
1.3 Understanding Key Information About Website Videos
Website videos are generally divided into two forms: “single video” and “video collection (playlist)”. The crawling logic differs slightly for each:
Single Video: Has a direct, independent video page URL.
Video Collection (Playlist): Contains multiple sub-videos (episodes) under one main page. Sub-video URLs usually follow a pattern (e.g., …?p=1, …?p=2).
Furthermore, the website encrypts its video resources to some extent. Parsing the HTML directly cannot obtain the real video download links. The you-get library has already encapsulated the decryption logic, so we only need to call its interface.
II. Core Function Implementation (Step-by-Step Explanation)
We will implement batch crawling in three stages: obtaining the video list ā single video crawling ā batch crawling of collections. Each stage includes complete code and comments.
2.1 Single Video Crawling (Basic Version)
First, we implement the most basic single video crawling function. The core is to call the you-get interface, specifying the video URL and the save path.
Code Implementation:
python
import os
import subprocess
def download_single_video(video_url, save_path):
“””
Single video download compatible with super old versions of you-get (retains only core parameters)
:param video_url: Video URL
:param save_path: Local save path (no special characters)
“””
# 1. Check and create the save path (to avoid failure due to non-existent path)
if not os.path.exists(save_path):
os.makedirs(save_path)
print(f”š Save directory created: {save_path}”)
# 2. Construct a command compatible with super old versions (retain only -o parameter, remove all incompatible options)
# Command format: you-get -o save_path video_url (Early versions only supported this basic format)
cmd = [
“you-get”, # Base command
“-o”, save_path, # Only keep the “specify save path” parameter (the only commonly supported option in super old versions)
video_url # Target video URL (must be placed at the end of the command)
]
# 3. Execute the command and capture the output (to view download status)
try:
result = subprocess.run(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
encoding=”utf-8″ # Resolve Chinese output garbled characters
)
# Check download result
if result.returncode == 0:
print(f”ā
Video downloaded successfully!\n URL: {video_url}\n Save Path: {save_path}\n”)
# Optional: Print download progress log (if speed/progress needs to be viewed)
# print(“Download Details:\n”, result.stdout)
else:
# Capture key error information (to avoid overly long output)
error_msg = result.stderr[:200] if len(result.stderr) > 200 else result.stderr
print(f”ā Video download failed!\n URL: {video_url}\n Error Info: {error_msg}…\n”)
except Exception as e:
print(f”ā Failed to call you-get!\n URL: {video_url}\n Exception Reason: {str(e)}\n”)
# ——————- Test Single Video Download ——————-
if __name__ == “__main__”:
# Test video URL (Choose public videos without copyright restrictions, like official website tutorials)
test_url = “https://www.bilibili.com/video/BV1GJ411x7h7”
# Save path (Use a simple path, no spaces/special symbols, e.g., a folder in the root directory of D drive)
save_dir = “D:/WebsiteVideos/SingleVideoTest”
# Call the function (Super old you-get can execute normally)
download_single_video(test_url, save_dir)
Code Explanation:
Path Handling: Uses os.makedirs to automatically create a non-existent save directory, avoiding the hassle of manual creation.
Parameter Configuration: –output-dir specifies the save path, –quality best means downloading the highest quality (e.g., 4K, 1080p). It can also be changed to –quality 720p to manually specify the quality.
Exception Handling: Uses try-except to capture errors during the download process (e.g., invalid URL, network interruption), preventing the program from crashing.
Execution Effect:
After running the code, the command line will display download progress (such as download speed, remaining time). After completion, the video file (usually in .mp4 format) will be saved in the D:/WebsiteVideos/SingleVideoTest directory.
2.2 Batch Crawling of Collection Videos (Advanced Version)
If you need to crawl a collection (like a series of courses, multi-episode documentaries), we first need to obtain the URLs of all sub-videos within the collection, then call the single video download function in a loop.
Implementation Idea:
Access the collection’s homepage (e.g., https://www.bilibili.com/video/BV1xx4y1x7xx?p=1).
Parse the HTML page to extract the sequence numbers and titles of all sub-videos in the collection.
Concatenate the complete URL for each sub-video.
Loop through and call the single video download function for batch saving.
Code Implementation:
python
import requests
from bs4 import BeautifulSoup
import os
import time
import subprocess
def get_playlist_urls(playlist_home_url):
“””
Get URLs of all sub-videos in a collection.
:param playlist_home_url: Collection homepage URL (e.g., https://www.bilibili.com/video/BV1xx4y1x7xx)
:return: List of sub-video URLs (containing title and link)
“””
# 1. Set request headers to simulate browser access (avoid being identified as a crawler by the website)
headers = {
“User-Agent”: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36”,
“Referer”: “https://www.bilibili.com/” # Referer verification, simulating a jump from the website homepage
}
# 2. Send GET request to obtain the homepage HTML of the collection
try:
response = requests.get(playlist_home_url, headers=headers, timeout=10)
response.encoding = “utf-8” # Set encoding to avoid Chinese garbled characters
soup = BeautifulSoup(response.text, “lxml”) # Use lxml to parse HTML
except Exception as e:
print(f”ā Failed to get collection homepage! Error: {str(e)}”)
return []
# 3. Parse HTML to extract sub-video information (title and URL)
# The sub-video list in the collection is under the ul tag with class “list-box”
video_list = soup.find(“div”, class_=”video-pod__list”)
if not video_list:
print(“ā Sub-video list not found in the collection. Possibly due to incorrect URL or changed page structure.”)
return []
b = 0
playlist = []
# Traverse each li tag for sub-videos
for li in video_list.find_all(“div”):
playlist.append({“title”: b, “url”: “https://www.bilibili.com/video/BV1wD4y1o7AS?p=” + str(b)})
b = b + 1
print(f”ā
Successfully obtained collection info! Total {len(playlist)} sub-videos.”)
return playlist
def download_playlist(playlist_home_url, save_path, delay=2):
“””
Batch download all sub-videos in a collection.
:param playlist_home_url: Collection homepage URL
:param save_path: Local save path
:param delay: Delay time (seconds) before downloading each video, to avoid being blocked due to fast requests.
“””
# 1. Get URLs of all sub-videos in the collection
playlist = get_playlist_urls(playlist_home_url)
if not playlist:
return # No video list obtained, exit directly.
# 2. Loop to download each sub-video
for index, video in enumerate(playlist, start=1):
video_title = video[“title”]
video_url = video[“url”]
print(f”\nš„ Starting download of video {index}/{len(playlist)}: {video_title}”)
print(video_url)
print(save_path)
# Call the single video download function
download_single_video(video_url, save_path)
# Delay for a while to avoid being blocked by the website’s anti-crawling mechanism (Key!)
time.sleep(delay)
print(f”\nš All videos downloaded! Batch save path: {save_path}”)
# Include the download_single_video function defined in section 2.1 here.
def download_single_video(video_url, save_path):
# … (Function body from section 2.1) …
pass
# ——————- Test Batch Download of Collection ——————-
if __name__ == “__main__”:
# Test collection homepage URL (Replace with the collection link you want to download)
playlist_url = “https://www.bilibili.com/video/BV1wD4y1o7AS”
# Local save path (It’s recommended to create a subdirectory by collection name for easy management)
save_dir = “D:/WebsiteVideos/PythonBasicsTutorialCollection”
# Call function for batch download (Delay 2 seconds, can be adjusted based on network conditions)
download_playlist(playlist_url, save_dir, delay=2)
Key Optimization Points:
Anti-Crawling Handling:
Set User-Agent and Referer to simulate browser access and avoid direct interception by the website.
Add time.sleep(delay) to control request frequency. A delay of 2-5 seconds is recommended to prevent IP blocking.
Resolving Chinese Garbled Characters: Use response.encoding = “utf-8” to force encoding, ensuring video titles display normally.
Fault Tolerance: Even if a sub-video download fails, the program continues to download the next one without interrupting the entire batch task.
2.3 Advanced Feature: Customizing Video Quality and Format
you-get supports customizing the quality and format of downloaded videos. For example, you can download only audio or specify 1080p quality. Simply modify the parameter configuration.
Example: Download Video with Specified Quality
Modify the args parameter in the download_single_video function:
python
# Download video in 720p quality, mp4 format
args = [
‘you-get’,
‘–output-dir’, save_path,
‘–quality’, ‘720p’, # Specify 720p quality
‘–format’, ‘mp4’, # Specify mp4 format
video_url
]
III. Crawling Precautions (Must Read!)
Comply with the Website’s User Agreement:
Crawled videos should only be used for personal learning and offline viewing. They must not be used for commercial purposes or distribution.
Do not crawl paid videos or member-exclusive videos, as this may involve copyright infringement.
Prevent IP Blocking:
Do not crawl videos too frequently or in large quantities. It’s recommended not to exceed 50 videos per day.
If you encounter “Access Denied”, pause for 1-2 hours before trying again, or change your network (e.g., use mobile hotspot).
Handle Special Situations:
Some videos may not be downloadable due to copyright issues. The program will prompt an error, which is normal.
If you-get cannot parse the latest website pages, update you-get to the latest version: pip install –upgrade you-get.
Permission Issues:
It’s recommended to choose a non-system drive (like D:, E:) for the save path to avoid permission issues preventing directory creation.
Windows users should not set the path to system directories like C:/Program Files, as it may be blocked by permissions.
IV. Troubleshooting Common Issues
Q1: When executing the code, I get “you-get: error: no such option: –output-dir”?
A: This is because the you-get version is too low. Execute pip install –upgrade you-get to update to the latest version.
Q2: The downloaded video has no sound or the format cannot be played?
A: For some website videos, the audio and video are stored separately. you-get automatically merges them, but occasionally the merge may fail. You can try:
1. Update you-get to the latest version.
2. Manually specify the format: –format mp4.
Q3: “Connection timed out” is reported?
A: This could be a network issue or the website server is temporarily unavailable. It is recommended to:
1. Check if the network is normal.
2. Increase the timeout parameter (e.g., requests.get(…, timeout=15)).
3. Try downloading again later.
V. Summary
This article implements batch crawling of website videos using requests + BeautifulSoup4 + you-get, covering from single video download to batch collection crawling, and then to customizing video quality, spanning most practical scenarios. The core advantages are:
Concise and easy-to-understand code, suitable for beginners.
Built-in anti-crawling handling and fault tolerance mechanisms ensure high stability.
Supports flexible configuration to meet different quality and format requirements.
If further optimization is needed, you can add features like “resume from breakpoint” (to avoid re-downloading) or “multi-threaded downloading” (to improve speed). Everyone can extend it based on their own needs.
Final Reminder: Web scraping should only be used for personal learning. Comply with platform rules and respect copyright!
Related articles