Introduction
In the previous session, we discussed how Pandas can conveniently read and store table-type files to enable interaction between local storage and Python.
However, in many scenarios, the data we need may be distributed across multiple different files or DataFrames. To integrate them into a complete dataset, we need to use concatenation as mentioned in this session’s title.
Pandas provides several concatenation methods, mainly including concat, merge, join, etc. Each has its applicable scenarios. Below, we introduce them.
concat
The
concatfunction is primarily used to concatenate multiple DataFrame or Series objects along a specified axis (row or column). Commonly used custom parameters include:
axis: Specifies the concatenation axis.0for rows,1for columns. Default is0.join: Specifies the concatenation method.'inner'retains only common rows or columns;'outer'retains all rows or columns. Default is'outer'.ignore_index: Specifies whether to reset the index. Default isFalse.
Let’s look at examples directly:
python
import numpy as np
import pandas as pd
# Create a dataset
data0 = pd.DataFrame(np.random.rand(5) * 10, index=pd.date_range('2000-01-01', periods=5, freq='Y'), columns=['temperature'])
data1 = pd.DataFrame(np.random.rand(5) * 1000, index=pd.date_range('2000-01-01', periods=5, freq='Y'), columns=['precipitation'])
print(data0)
print(data1)
# Concatenate datasets by column (axis=1)
data = pd.concat([data0, data1], axis=1)
print(data)
# Output (example):
# temperature
# 2000-12-31 2.956534
# 2001-12-31 8.961623
# 2002-12-31 0.720613
# 2003-12-31 4.456005
# 2004-12-31 0.500873
# precipitation
# 2000-12-31 44.627392
# 2001-12-31 804.956900
# 2002-12-31 582.525174
# 2003-12-31 628.544335
# 2004-12-31 100.806668
# temperature precipitation
# 2000-12-31 2.956534 44.627392
# 2001-12-31 8.961623 804.956900
# 2002-12-31 0.720613 582.525174
# 2003-12-31 4.456005 628.544335
# 2004-12-31 0.500873 100.806668
python
# Create another dataset
data0 = pd.DataFrame(np.random.rand(5) * 10, index=pd.date_range('2000-01-01', periods=5, freq='Y'), columns=['temperature'])
data1 = pd.DataFrame(np.random.rand(5) * 1000, index=pd.date_range('2005-01-01', periods=5, freq='Y'), columns=['temperature'])
print(data0)
print(data1)
# Concatenate datasets by row (axis=0)
data = pd.concat([data0, data1])
print(data)
# Output (example):
# temperature
# 2000-12-31 9.371567
# 2001-12-31 9.577373
# 2002-12-31 9.100010
# 2003-12-31 1.377707
# 2004-12-31 2.171535
# temperature
# 2005-12-31 182.571988
# 2006-12-31 623.641092
# 2007-12-31 865.181408
# 2008-12-31 333.438249
# 2009-12-31 780.453570
# temperature
# 2000-12-31 9.371567
# 2001-12-31 9.577373
# 2002-12-31 9.100010
# 2003-12-31 1.377707
# 2004-12-31 2.171535
# 2005-12-31 182.571988
# 2006-12-31 623.641092
# 2007-12-31 865.181408
# 2008-12-31 333.438249
# 2009-12-31 780.453570
python
# Create a more complex dataset
data0 = pd.DataFrame(np.random.rand(5, 3) * 10, index=pd.date_range('2000-01-01', periods=5, freq='Y'), columns=['temperature', 'pressure', 'humidity'])
data1 = pd.DataFrame(np.random.rand(5, 4) * 10, index=pd.date_range('2001-01-01', periods=5, freq='Y'), columns=['temperature', 'precipitation', 'humidity', 'wind'])
print(data0)
print(data1)
# The default concatenation method is 'outer', taking the union of both datasets.
data = pd.concat([data0, data1], axis=1)
print(data)
# Output (example):
# temperature pressure humidity
# 2000-12-31 1.001251 2.545684 9.666708
# 2001-12-31 2.902656 9.816877 7.251775
# 2002-12-31 3.361788 8.067900 3.038989
# 2003-12-31 9.847927 4.671510 9.594601
# 2004-12-31 5.341185 3.827158 4.114981
# temperature precipitation humidity wind
# 2001-12-31 3.272974 4.891393 4.228730 1.531332
# 2002-12-31 4.177929 5.301486 5.563462 6.474081
# 2003-12-31 8.423616 6.799374 9.836578 9.643460
# 2004-12-31 7.839116 6.346595 2.614979 9.741067
# 2005-12-31 8.017644 7.170777 4.284372 0.935709
# temperature pressure humidity temperature precipitation \
# 2000-12-31 1.001251 2.545684 9.666708 NaN NaN
# 2001-12-31 2.902656 9.816877 7.251775 3.272974 4.891393
# 2002-12-31 3.361788 8.067900 3.038989 4.177929 5.301486
# 2003-12-31 9.847927 4.671510 9.594601 8.423616 6.799374
# 2004-12-31 5.341185 3.827158 4.114981 7.839116 6.346595
# 2005-12-31 NaN NaN NaN 8.017644 7.170777
#
# humidity wind
# 2000-12-31 NaN NaN
# 2001-12-31 4.228730 1.531332
# 2002-12-31 5.563462 6.474081
# 2003-12-31 9.836578 9.643460
# 2004-12-31 2.614979 9.741067
# 2005-12-31 4.284372 0.935709
We can note that when dealing with a more complex dataset, more settings are needed to achieve the desired result. The default concatenation method for concat is 'outer', taking the union of both datasets.
Additionally, for the dimension not being concatenated, if data at the corresponding position doesn’t exist, it’s filled with NaN (here, since we concatenate by column, values for corresponding row indices that don’t exist are filled with NaN).
We can also keep only the columns that exist in both datasets (intersection):
python
# Concatenation method is 'inner', taking the intersection of both datasets. data = pd.concat([data0, data1], axis=1, join='inner') # Keep only common rows. print(data) # Note: Since we are concatenating by column, the intersection applies to row indices. # Let's try the effect of different axis concatenation in this example. data = pd.concat([data0, data1], axis=0, join='inner') # Keep only common columns. print(data) # Output (example): # (axis=1, join='inner') # temperature pressure humidity temperature precipitation \ # 2001-12-31 2.902656 9.816877 7.251775 3.272974 4.891393 # 2002-12-31 3.361788 8.067900 3.038989 4.177929 5.301486 # 2003-12-31 9.847927 4.671510 9.594601 8.423616 6.799374 # 2004-12-31 5.341185 3.827158 4.114981 7.839116 6.346595 # # humidity wind # 2001-12-31 4.228730 1.531332 # 2002-12-31 5.563462 6.474081 # 2003-12-31 9.836578 9.643460 # 2004-12-31 2.614979 9.741067 # (axis=0, join='inner') # temperature humidity # 2000-12-31 1.001251 9.666708 # 2001-12-31 2.902656 7.251775 # 2002-12-31 3.361788 3.038989 # 2003-12-31 9.847927 9.594601 # 2004-12-31 5.341185 4.114981 # 2001-12-31 3.272974 4.228730 # 2002-12-31 4.177929 5.563462 # 2003-12-31 8.423616 9.836578 # 2004-12-31 7.839116 2.614979 # 2005-12-31 8.017644 4.284372
Finally, in the last output above, duplicate row indices will cause trouble during indexing (try running data.loc['2001-12-31', :]).
If row indices aren’t significant, we can simply reset them:
python
data = pd.concat([data0, data1], axis=0, ignore_index=True) print(data) # Output (example): # temperature pressure humidity precipitation wind # 0 1.001251 2.545684 9.666708 NaN NaN # 1 2.902656 9.816877 7.251775 NaN NaN # 2 3.361788 8.067900 3.038989 NaN NaN # 3 9.847927 4.671510 9.594601 NaN NaN # 4 5.341185 3.827158 4.114981 NaN NaN # 5 3.272974 NaN 4.228730 4.891393 1.531332 # 6 4.177929 NaN 5.563462 5.301486 6.474081 # 7 8.423616 NaN 9.836578 6.799374 9.643460 # 8 7.839116 NaN 2.614979 6.346595 9.741067 # 9 8.017644 NaN 4.284372 7.170777 0.935709
merge
The merge function is used to combine DataFrames based on one or more keys.
Unlike concat, which concatenates along an axis, merge is more akin to a relational database join operation.
It associates rows from two DataFrames based on specified keys, similar to the pivot functionality in Excel.
Main parameters include:
left: The first DataFrame.right: The second DataFrame.on: The join key(s), column name(s) existing in both DataFrames.left_on: Join key(s) from the left DataFrame.right_on: Join key(s) from the right DataFrame.left_index: Whether to use the left DataFrame’s index as the join key.right_index: Whether to use the right DataFrame’s index as the join key.how: Join method. Options are'inner'(inner join),'outer'(outer join),'left'(left join),'right'(right join).suffixes: Suffixes to add when join column names overlap.
Let’s demonstrate using the random data above:
python
left = data0.copy() right = data1.copy() df_merge = pd.merge(left, right, on='temperature', how='inner') print(df_merge) # Output: # Empty DataFrame # Columns: [temperature, pressure, humidity_x, precipitation, humidity_y, wind] # Index: []
Unlike concat, which matches based on column/row names, merge requires identical values to join two datasets. Since our specified temperature random numbers have no duplicates, the result is empty.
Let’s modify it slightly:
python
right['temperature'] = left['temperature'] # Copy temperature column to ensure matches. df_merge = pd.merge(left, right, on='temperature', how='inner') print(df_merge) # Output (example): # temperature pressure humidity_x precipitation humidity_y wind # 0 2.902656 9.816877 7.251775 4.891393 4.228730 1.531332 # 1 3.361788 8.067900 3.038989 5.301486 5.563462 6.474081 # 2 9.847927 4.671510 9.594601 6.799374 9.836578 9.643460 # 3 5.341185 3.827158 4.114981 6.346595 2.614979 9.741067
Of course, we can also try other join methods:
python
# Join method 'outer', taking the union. df_merge = pd.merge(left, right, on='temperature', how='outer') print(df_merge) # Join method 'left', keeping only rows corresponding to keys in the left table. df_merge = pd.merge(left, right, on='temperature', how='left') print(df_merge) # Join method 'right', keeping only rows corresponding to keys in the right table. df_merge = pd.merge(left, right, on='temperature', how='right') print(df_merge) # Output (example - outer): # temperature pressure humidity_x precipitation humidity_y wind # 0 1.001251 2.545684 9.666708 NaN NaN NaN # 1 2.902656 9.816877 7.251775 4.891393 4.228730 1.531332 # 2 3.361788 8.067900 3.038989 5.301486 5.563462 6.474081 # 3 9.847927 4.671510 9.594601 6.799374 9.836578 9.643460 # 4 5.341185 3.827158 4.114981 6.346595 2.614979 9.741067 # 5 NaN NaN NaN 7.170777 4.284372 0.935709 # ... (left and right similar pattern) ...
Or directly use index values (row labels) for joining:
python
# Merge tables based on row index.
# Briefly try the effect of suffixes.
df_merge = pd.merge(left, right, left_index=True, right_index=True, how='outer', suffixes=('_L', '_R'))
print(df_merge)
# Output (example):
# temperature_L pressure humidity_L temperature_R precipitation \
# 2000-12-31 1.001251 2.545684 9.666708 NaN NaN
# 2001-12-31 2.902656 9.816877 7.251775 2.902656 4.891393
# 2002-12-31 3.361788 8.067900 3.038989 3.361788 5.301486
# 2003-12-31 9.847927 4.671510 9.594601 9.847927 6.799374
# 2004-12-31 5.341185 3.827158 4.114981 5.341185 6.346595
# 2005-12-31 NaN NaN NaN NaN 7.170777
#
# humidity_R wind
# 2000-12-31 NaN NaN
# 2001-12-31 4.228730 1.531332
# 2002-12-31 5.563462 6.474081
# 2003-12-31 9.836578 9.643460
# 2004-12-31 2.614979 9.741067
# 2005-12-31 4.284372 0.935709
In some scenarios, column names for corresponding data in two tables differ, or we need to join based on multiple columns. Then we need to specify different column names for indexing:
python
right['tas'] = left['temperature'] # Rename columns for demonstration.
right['rh'] = left['humidity']
df_merge = pd.merge(left, right, left_on=['temperature', 'humidity'], right_on=['tas', 'rh'], how='inner', suffixes=('_α', '_β'))
print(df_merge)
# Output (example):
# temperature_α pressure humidity_α temperature_β precipitation \
# 0 2.902656 9.816877 7.251775 2.902656 4.891393
# 1 3.361788 8.067900 3.038989 3.361788 5.301486
# 2 9.847927 4.671510 9.594601 9.847927 6.799374
# 3 5.341185 3.827158 4.114981 5.341185 6.346595
#
# humidity_β wind tas rh
# 0 4.228730 1.531332 2.902656 7.251775
# 1 5.563462 6.474081 3.361788 3.038989
# 2 9.836578 9.643460 9.847927 9.594601
# 3 2.614979 9.741067 5.341185 4.114981
join
This function is similar to a combination of the above two. It merges two DataFrame objects based on their index (row labels).
Its usage is akin to pd.merge(left, right, left_index=True, right_index=True, suffixes=('_L', '_R')).
Let’s directly see the effect of executing such statements:
python
left = data0.copy() right = data1.copy() df_join = left.join(right, how='outer', lsuffix='_L', rsuffix='_R') print(df_join) df_join = left.join(right, how='inner', lsuffix='_L', rsuffix='_R') print(df_join) df_join = left.join(right, how='left', lsuffix='_L', rsuffix='_R') print(df_join) df_join = left.join(right, how='right', lsuffix='_L', rsuffix='_R') print(df_join) # Output (example - outer): # temperature_L pressure humidity_L temperature_R precipitation \ # 2000-12-31 1.001251 2.545684 9.666708 NaN NaN # 2001-12-31 2.902656 9.816877 7.251775 3.272974 4.891393 # 2002-12-31 3.361788 8.067900 3.038989 4.177929 5.301486 # 2003-12-31 9.847927 4.671510 9.594601 8.423616 6.799374 # 2004-12-31 5.341185 3.827158 4.114981 7.839116 6.346595 # 2005-12-31 NaN NaN NaN 8.017644 7.170777 # # humidity_R wind # 2000-12-31 NaN NaN # 2001-12-31 4.228730 1.531332 # 2002-12-31 5.563462 6.474081 # 2003-12-31 9.836578 9.643460 # 2004-12-31 2.614979 9.741067 # 2005-12-31 4.284372 0.935709 # ... (inner, left, right similar) ...
Postscript
The above covers the basic content of concatenating data using Pandas. As for its practical application, the most straightforward example might be batch reading meteorological data stored on a yearly basis and merging them into a complete time series for subsequent processing.