Just started learning Python for data analysis, feeling overwhelmed when Excel chokes on millions of rows? Spending late nights manually cleaning data? Don’t worry! This practical tutorial for beginners delivers 10 game-changing techniques covering the entire data processing workflow. Each technique comes with a “real-world use case + copy-paste-ready code”. No need to understand complex theory—just follow along to handle tens of millions of rows and instantly level up your data analysis skills!
Pandas is Python’s ultimate data analysis tool. It can easily replace Excel’s filtering, pivot tables, data merging, and more, while handling massive datasets that Excel can’t cope with. As a beginner, you don’t need to learn everything. Master these 10 core techniques, and you’ll handle 80% of real-world tasks!
1. First, Understand Pandas’ Two Core Structures (The Absolute Basics)
Before learning Pandas, remember these two “data containers”: Series (one-dimensional data, like a single column table) and DataFrame (two-dimensional data, like an Excel spreadsheet). All operations revolve around them.
Use Case: Storing simple data (like grades, employee info).
python
import pandas as pd # Import the Pandas library, conventionally abbreviated as 'pd'
# 1. Create a Series (1D data: e.g., a student's grades in three subjects)
score_series = pd.Series([88, 92, 78], index=['Math', 'English', 'Chinese'])
print("=== Student Grades (Series) ===")
print(score_series)
# Output:
# Math 88
# English 92
# Chinese 78
# dtype: int64
# 2. Create a DataFrame (2D data: e.g., basic info for 3 employees)
employee_df = pd.DataFrame({
'Name': ['Zhang San', 'Li Si', 'Wang Wu'],
'Age': [22, 25, 30],
'Department': ['Tech', 'Sales', 'Marketing'],
'Salary': [15000, 23000, 35000]
})
print("\n=== Employee Info Table (DataFrame) ===")
print(employee_df)
# The output looks just like an Excel table, with rows and columns, easy to understand.
Beginner Notes:
- A Series is like a “labeled array,” where each value has a corresponding index (e.g., subject names).
- A DataFrame is “tabular data,” with column names (like Name, Age) and row indices (default: 0, 1, 2…). It’s the most commonly used structure in real work.
2. Data Import/Export: Handle 15+ File Formats with 1 Line of Code
In real work, data is often stored in Excel, CSV, or JSON files. Pandas can read and save them in one line—no more manual copy-pasting.
Use Case 1: Read sales data in CSV format.
python
# Read a local CSV file (e.g., e-commerce sales data)
# `encoding='gbk'` resolves Chinese character issues (commonly used on Windows).
sales_df = pd.read_csv('ecommerce_sales.csv', encoding='gbk')
print("=== Read Sales Data ===")
print(sales_df.head()) # head() shows only the first 5 rows to avoid flooding the screen.
Use Case 2: Read an Excel file and save a new file.
python
# Read the Q1 financial sheet from an Excel file (specifying the sheet name).
finance_q1 = pd.read_excel('company_finance.xlsx', sheet_name='Q1')
# After processing, save as a new Excel file (`index=False` prevents saving row indices for a cleaner look).
finance_q1.to_excel('cleaned_Q1_finance.xlsx', index=False)
print("Excel file saved successfully!")
Use Case 3: Read user data in JSON format.
python
# Read a JSON string (e.g., data from an API response).
json_data = '{"name":["Lisa","Mike","Tom"], "score":[95,87,92], "gender":["F","M","M"]}'
user_df = pd.read_json(json_data)
print("\n=== User Data from JSON ===")
print(user_df)
Common Import/Export Summary:
| Operation | Code Example | Use Case |
|---|---|---|
| Read CSV | pd.read_csv('file_path.csv', encoding='gbk') | E-commerce sales, user behavior data |
| Read Excel | pd.read_excel('file_path.xlsx', sheet_name='Sheet1') | Financial reports, employee attendance |
| Save Excel | df.to_excel('save_path.xlsx', index=False) | Sharing analysis results with colleagues |
| Read JSON | pd.read_json('file_path_or_string') | Data returned from an API |
3. Data Cleaning in Four Steps: Tame Your “Dirty Data” (High Frequency in Practice)
Real-world data often has missing values, duplicates, and outliers (e.g., an unrealistic salary of 1,000,000). These four steps clean it up fast.
Use Case: Cleaning a messy employee info table.
python
# First, create a messy employee table (simulating a real scenario).
dirty_employee_df = pd.DataFrame({
'Name': ['Zhang San', 'Li Si', 'Wang Wu', 'Zhang San', 'Zhao Liu'],
'Age': [22, None, 30, 22, 100], # None represents a missing value.
'Department': ['Tech', 'Sales', None, 'Tech', 'Marketing'],
'Salary': [15000, 23000, 35000, 15000, 500000] # 500000 is an outlier.
})
print("=== Dirty Data Before Cleaning ===")
print(dirty_employee_df)
# 1. Handle Missing Values (Two common ways)
# Method 1: Fill missing ages with the mean age.
dirty_employee_df['Age'].fillna(dirty_employee_df['Age'].mean(), inplace=True)
# Method 2: Drop rows missing the 'Department' value (critical info).
dirty_employee_df.dropna(subset=['Department'], inplace=True)
# 2. Remove Duplicate Rows
# Consider rows with duplicate 'Name' as duplicates, keep the first occurrence.
dirty_employee_df.drop_duplicates(subset=['Name'], keep='first', inplace=True)
# 3. Handle Outliers (Assume reasonable salary: 5000-100000, delete others).
dirty_employee_df = dirty_employee_df[(dirty_employee_df['Salary'] > 5000) & (dirty_employee_df['Salary'] < 100000)]
# 4. Type Conversion (Convert 'Age' from float to int, more appropriate).
dirty_employee_df['Age'] = dirty_employee_df['Age'].astype(int)
print("\n=== Clean Data After Cleaning ===")
print(dirty_employee_df)
Beginner Notes:
inplace=Truemodifies the original DataFrame directly, no need to reassign.- Missing Values: Fill numerical data with mean/median. Fill categorical data (like Department) with the mode, or drop.
- Outliers: First define a reasonable range (e.g., Age 18-60), then filter out data outside that range.
4. Data Selection: Pinpoint Exactly What You Need (As Easy as Excel Filtering)
Pandas’ loc and iloc are the “magic tools” for data selection. loc selects by label (column/row names), iloc selects by integer position. More flexible than Excel.
Use Case: Selecting specific data from the employee table.
python
# Use the clean employee DataFrame from above.
clean_employee_df = dirty_employee_df # Continue from the cleaned result.
# 1. Conditional Filtering: Find employees with Salary > 20000.
high_salary_df = clean_employee_df[clean_employee_df['Salary'] > 20000]
print("=== Employees with Salary > 20000 ===")
print(high_salary_df)
# 2. Multiple Condition Filtering: Tech Dept employees under 30.
tech_young_df = clean_employee_df[(clean_employee_df['Department'] == 'Tech') & (clean_employee_df['Age'] < 30)]
print("\n=== Tech Dept Employees Under 30 ===")
print(tech_young_df)
# 3. Label-based Selection (`loc`: uses row index labels + column names).
# Select first 2 rows, only 'Name' and 'Salary' columns.
loc_df = clean_employee_df.loc[0:1, ['Name', 'Salary']]
print("\n=== `loc` Selection Result ===")
print(loc_df)
# 4. Integer Position-based Selection (`iloc`: uses row/column integer positions).
# Select rows 0 and 2, columns 1 to 2 (positions: 1=Age, 2=Department).
iloc_df = clean_employee_df.iloc[[0, 2], 1:3]
print("\n=== `iloc` Selection Result ===")
print(iloc_df)
Memory Tip:
loc= location, selects by name (e.g., column name ‘Salary’, row index label 0).iloc= integer location, selects by number (e.g., 1st column, 2nd row).
5. Grouping & Aggregation: 1 Line of Code to Replace Excel Pivot Tables
Need average salary per department or total sales per category? Use groupby. More efficient than Excel pivot tables, and batch processes easily.
Use Case: Salary statistics per department.
python
# Continue using the employee DataFrame.
print("=== Salary Statistics by Department ===")
# Group by 'Department', aggregate Salary (mean, max, min) and Age (median).
dept_salary_stats = clean_employee_df.groupby('Department').agg({
'Salary': ['mean', 'max', 'min'], # Three salary metrics.
'Age': 'median' # Median age.
})
print(dept_salary_stats)
# Simplify column names (beginner-friendly, avoids multi-level columns).
dept_salary_stats.columns = ['Avg Salary', 'Max Salary', 'Min Salary', 'Median Age']
dept_salary_stats.reset_index(inplace=True) # Move 'Department' from index back to a column.
print("\n=== Simplified Statistics ===")
print(dept_salary_stats)
Extended Use Case: Top-selling e-commerce products.
python
# Read e-commerce sales data.
sales_df = pd.read_csv('ecommerce_sales.csv', encoding='gbk')
# Group by 'Product Name', calculate total sales (Price * Quantity).
sales_df['Revenue'] = sales_df['Price'] * sales_df['Quantity']
top_product = sales_df.groupby('Product Name')['Revenue'].sum().sort_values(ascending=False).head(10)
print("\n=== Top 10 Best-Selling Products ===")
print(top_product)
6. Data Merging: Combining Multiple Tables (As Powerful as SQL JOINs)
Real-world data often sits in multiple tables (e.g., employee table, department table). Use concat and merge to quickly combine them.
Use Case 1: Vertical Concatenation (Same structure, e.g., Q1 & Q2 sales data).
python
# Simulate Q1 and Q2 sales data (same structure).
q1_sales = pd.DataFrame({
'Date': ['2024-01-01', '2024-02-01'],
'Product': ['A', 'B'],
'Quantity': [100, 200]
})
q2_sales = pd.DataFrame({
'Date': ['2024-04-01', '2024-05-01'],
'Product': ['C', 'D'],
'Quantity': [150, 250]
})
# Concatenate vertically (`ignore_index=True` resets the row index).
all_sales = pd.concat([q1_sales, q2_sales], ignore_index=True)
print("=== Concatenated Full-Year Sales Data ===")
print(all_sales)
Use Case 2: Horizontal Merging (Joining different tables, e.g., employee and department tables).
python
# Employee Table (has DepartmentID).
employee_df = pd.DataFrame({
'EmployeeID': [1, 2, 3],
'Name': ['Zhang San', 'Li Si', 'Wang Wu'],
'DepartmentID': [101, 102, 101]
})
# Department Table (has DepartmentID and Department Name).
dept_df = pd.DataFrame({
'DepartmentID': [101, 102],
'DepartmentName': ['Tech', 'Sales'],
'Manager': ['Mr. Zhao', 'Ms. Sun']
})
# Join the two tables on 'DepartmentID' (left join keeps all employee data).
merged_df = pd.merge(
left=employee_df,
right=dept_df,
how='left', # Left join: keep all left table data, NaN for unmatched right rows.
on='DepartmentID' # Key field for the join.
)
print("\n=== Merged Employee-Department Table ===")
print(merged_df)
7. Time Series Processing: Conquer Date-Related Data (e.g., Sales Trend Analysis)
Pandas handles date data superbly. Quickly aggregate data by day, month, or quarter—perfect for trend analysis.
Use Case: Analyzing monthly sales trends.
python
# Simulate 3 months of sales data (with inconsistent date formats).
sales_date_df = pd.DataFrame({
'Date': ['2024/01/05', '2024-02-10', '2024.03.15', '2024/01/20'],
'Revenue': [15000, 23000, 18000, 20000]
})
# 1. Convert date format (unify to datetime type).
sales_date_df['Date'] = pd.to_datetime(sales_date_df['Date'], format='mixed')
# 2. Group by "Month" and sum Revenue.
monthly_sales = sales_date_df.groupby(sales_date_df['Date'].dt.to_period('M'))['Revenue'].sum()
print("=== Monthly Revenue Statistics ===")
print(monthly_sales)
# 3. Time Shifting (e.g., view the date 7 days ago).
sales_date_df['Date_7_days_ago'] = sales_date_df['Date'].shift(7)
print("\n=== Data with 7-Day Shift ===")
print(sales_date_df)
Beginner Notes:
pd.to_datetime()automatically recognizes many date formats (like 2024/01/05, 2024-02-10)—incredibly useful!dt.to_period('M')groups by month. Similarly, ‘Y’ is for year, ‘D’ for day.
8. Performance Optimization: Handle Tens of Millions of Rows Without Lag
If your dataset is huge (e.g., 10 million rows), direct processing may lag. These three techniques significantly boost efficiency.
Use Case: Processing a large CSV file (tens of millions of rows).
python
# 1. Use efficient data types (reduce memory usage).
# E.g., convert 'EmployeeID' from int64 to int32, saving ~60% memory.
clean_employee_df['EmployeeID'] = clean_employee_df['EmployeeID'].astype('int32')
# 2. Read large data in chunks (avoid loading everything into memory).
# `chunksize=100000` reads 100,000 rows at a time.
chunk_iter = pd.read_csv('massive_sales_data.csv', chunksize=100000, encoding='gbk')
for chunk in chunk_iter:
# Process each chunk (e.g., sum the quantity).
chunk_sales = chunk['Quantity'].sum()
print(f"Current chunk total quantity: {chunk_sales}")
# 3. Use vectorized operations instead of loops (20x+ faster).
# Bad way: Use a for loop to calculate bonus (Salary * 0.1).
# Good way: Vectorized operation.
clean_employee_df['Bonus'] = clean_employee_df['Salary'] * 0.1
print("\n=== Data with Calculated Bonus ===")
print(clean_employee_df[['Name', 'Salary', 'Bonus']])
9. 3 Common Beginner Pitfalls You MUST Avoid
- Chained Assignment Warning: Don’t write
df[df.Age>30]['Salary'] = 0(it will raise an error). Correct way:df.loc[df.Age>30, 'Salary'] = 0. - SettingWithCopyWarning: When this warning appears, you’re likely modifying a copy of the data. Use
copy()to explicitly create a copy:new_df = df[df.Department=='Tech'].copy(). - Unconverted Date Format: Grouping by string-formatted dates will fail. Always use
pd.to_datetime()first.
10. Complete Practical Case: E-commerce Sales Data Analysis
Combining the techniques above, let’s fully analyze e-commerce sales data to see Pandas in action.
python
# 1. Read Data
orders_df = pd.read_csv('orders.csv', encoding='gbk') # Order data.
products_df = pd.read_csv('products.csv', encoding='gbk') # Product data.
# 2. Data Cleaning
orders_df.dropna(subset=['UserID', 'ProductID'], inplace=True) # Drop rows missing key fields.
orders_df.drop_duplicates(inplace=True) # Remove duplicate orders.
# 3. Data Merging (Join orders and products tables).
full_df = pd.merge(orders_df, products_df, on='ProductID', how='left')
# 4. Calculate Key Metrics
full_df['Revenue'] = full_df['Price'] * full_df['Quantity'] # Calculate revenue.
# 5. Analyze Top 10 Best-Selling Products.
top10_products = full_df.groupby('ProductName')['Revenue'].sum().sort_values(ascending=False).head(10)
print("=== Top 10 Best-Selling Products ===")
print(top10_products)
# 6. Analyze Monthly Sales Trend
full_df['OrderDate'] = pd.to_datetime(full_df['OrderDate'], format='mixed')
monthly_trend = full_df.groupby(full_df['OrderDate'].dt.to_period('M'))['Revenue'].sum()
print("\n=== Monthly Sales Trend ===")
print(monthly_trend)
# 7. Save Analysis Results to Excel
with pd.ExcelWriter('ecommerce_analysis_report.xlsx') as writer:
top10_products.to_excel(writer, sheet_name='Top10 Products')
monthly_trend.to_excel(writer, sheet_name='Monthly Trend')
print("\nAnalysis report saved as an Excel file!")
Summary
The essence of Pandas is “doing the most complex data processing with the simplest code.” As a beginner, don’t memorize everything. First, master these 10 techniques. When you encounter real problems, refer to the relevant use case and code. You’ll get the hang of it in no time!