Pandas 05 | How to Use Pandas to Concatenate Data?

Introduction

In the previous session, we discussed how Pandas can conveniently read and store table-type files to enable interaction between local storage and Python.

However, in many scenarios, the data we need may be distributed across multiple different files or DataFrames. To integrate them into a complete dataset, we need to use concatenation as mentioned in this session’s title.

Pandas provides several concatenation methods, mainly including concat, merge, join, etc. Each has its applicable scenarios. Below, we introduce them.

concat

The concat function is primarily used to concatenate multiple DataFrame or Series objects along a specified axis (row or column). Commonly used custom parameters include:

axis: Specifies the concatenation axis. 0 for rows, 1 for columns. Default is 0.

join: Specifies the concatenation method. 'inner' retains only common rows or columns; 'outer' retains all rows or columns. Default is 'outer'.

ignore_index: Specifies whether to reset the index. Default is False.

Let’s look at examples directly:

python

import numpy as np
import pandas as pd

# Create a dataset
data0 = pd.DataFrame(np.random.rand(5) * 10, index=pd.date_range('2000-01-01', periods=5, freq='Y'), columns=['temperature'])
data1 = pd.DataFrame(np.random.rand(5) * 1000, index=pd.date_range('2000-01-01', periods=5, freq='Y'), columns=['precipitation'])

print(data0)
print(data1)

# Concatenate datasets by column (axis=1)
data = pd.concat([data0, data1], axis=1)
print(data)

# Output (example):
#               temperature
# 2000-12-31     2.956534
# 2001-12-31     8.961623
# 2002-12-31     0.720613
# 2003-12-31     4.456005
# 2004-12-31     0.500873
#               precipitation
# 2000-12-31      44.627392
# 2001-12-31     804.956900
# 2002-12-31     582.525174
# 2003-12-31     628.544335
# 2004-12-31     100.806668
#               temperature  precipitation
# 2000-12-31     2.956534      44.627392
# 2001-12-31     8.961623     804.956900
# 2002-12-31     0.720613     582.525174
# 2003-12-31     4.456005     628.544335
# 2004-12-31     0.500873     100.806668

python

# Create another dataset
data0 = pd.DataFrame(np.random.rand(5) * 10, index=pd.date_range('2000-01-01', periods=5, freq='Y'), columns=['temperature'])
data1 = pd.DataFrame(np.random.rand(5) * 1000, index=pd.date_range('2005-01-01', periods=5, freq='Y'), columns=['temperature'])

print(data0)
print(data1)

# Concatenate datasets by row (axis=0)
data = pd.concat([data0, data1])
print(data)

# Output (example):
#               temperature
# 2000-12-31     9.371567
# 2001-12-31     9.577373
# 2002-12-31     9.100010
# 2003-12-31     1.377707
# 2004-12-31     2.171535
#               temperature
# 2005-12-31   182.571988
# 2006-12-31   623.641092
# 2007-12-31   865.181408
# 2008-12-31   333.438249
# 2009-12-31   780.453570
#               temperature
# 2000-12-31     9.371567
# 2001-12-31     9.577373
# 2002-12-31     9.100010
# 2003-12-31     1.377707
# 2004-12-31     2.171535
# 2005-12-31   182.571988
# 2006-12-31   623.641092
# 2007-12-31   865.181408
# 2008-12-31   333.438249
# 2009-12-31   780.453570

python

# Create a more complex dataset
data0 = pd.DataFrame(np.random.rand(5, 3) * 10, index=pd.date_range('2000-01-01', periods=5, freq='Y'), columns=['temperature', 'pressure', 'humidity'])
data1 = pd.DataFrame(np.random.rand(5, 4) * 10, index=pd.date_range('2001-01-01', periods=5, freq='Y'), columns=['temperature', 'precipitation', 'humidity', 'wind'])

print(data0)
print(data1)

# The default concatenation method is 'outer', taking the union of both datasets.
data = pd.concat([data0, data1], axis=1)
print(data)

# Output (example):
#               temperature  pressure  humidity
# 2000-12-31     1.001251  2.545684  9.666708
# 2001-12-31     2.902656  9.816877  7.251775
# 2002-12-31     3.361788  8.067900  3.038989
# 2003-12-31     9.847927  4.671510  9.594601
# 2004-12-31     5.341185  3.827158  4.114981
#               temperature  precipitation  humidity      wind
# 2001-12-31     3.272974       4.891393  4.228730  1.531332
# 2002-12-31     4.177929       5.301486  5.563462  6.474081
# 2003-12-31     8.423616       6.799374  9.836578  9.643460
# 2004-12-31     7.839116       6.346595  2.614979  9.741067
# 2005-12-31     8.017644       7.170777  4.284372  0.935709
#               temperature  pressure  humidity  temperature  precipitation  \
# 2000-12-31     1.001251  2.545684  9.666708          NaN            NaN
# 2001-12-31     2.902656  9.816877  7.251775     3.272974       4.891393
# 2002-12-31     3.361788  8.067900  3.038989     4.177929       5.301486
# 2003-12-31     9.847927  4.671510  9.594601     8.423616       6.799374
# 2004-12-31     5.341185  3.827158  4.114981     7.839116       6.346595
# 2005-12-31          NaN       NaN       NaN     8.017644       7.170777
#
#               humidity      wind
# 2000-12-31       NaN       NaN
# 2001-12-31  4.228730  1.531332
# 2002-12-31  5.563462  6.474081
# 2003-12-31  9.836578  9.643460
# 2004-12-31  2.614979  9.741067
# 2005-12-31  4.284372  0.935709

We can note that when dealing with a more complex dataset, more settings are needed to achieve the desired result. The default concatenation method for concat is 'outer', taking the union of both datasets.

Additionally, for the dimension not being concatenated, if data at the corresponding position doesn’t exist, it’s filled with NaN (here, since we concatenate by column, values for corresponding row indices that don’t exist are filled with NaN).

We can also keep only the columns that exist in both datasets (intersection):

python

# Concatenation method is 'inner', taking the intersection of both datasets.
data = pd.concat([data0, data1], axis=1, join='inner')      # Keep only common rows.
print(data)

# Note: Since we are concatenating by column, the intersection applies to row indices.
# Let's try the effect of different axis concatenation in this example.
data = pd.concat([data0, data1], axis=0, join='inner')      # Keep only common columns.
print(data)

# Output (example):
# (axis=1, join='inner')
#               temperature  pressure  humidity  temperature  precipitation  \
# 2001-12-31     2.902656  9.816877  7.251775     3.272974       4.891393
# 2002-12-31     3.361788  8.067900  3.038989     4.177929       5.301486
# 2003-12-31     9.847927  4.671510  9.594601     8.423616       6.799374
# 2004-12-31     5.341185  3.827158  4.114981     7.839116       6.346595
#
#               humidity      wind
# 2001-12-31  4.228730  1.531332
# 2002-12-31  5.563462  6.474081
# 2003-12-31  9.836578  9.643460
# 2004-12-31  2.614979  9.741067

# (axis=0, join='inner')
#               temperature  humidity
# 2000-12-31     1.001251  9.666708
# 2001-12-31     2.902656  7.251775
# 2002-12-31     3.361788  3.038989
# 2003-12-31     9.847927  9.594601
# 2004-12-31     5.341185  4.114981
# 2001-12-31     3.272974  4.228730
# 2002-12-31     4.177929  5.563462
# 2003-12-31     8.423616  9.836578
# 2004-12-31     7.839116  2.614979
# 2005-12-31     8.017644  4.284372

Finally, in the last output above, duplicate row indices will cause trouble during indexing (try running data.loc['2001-12-31', :]).

If row indices aren’t significant, we can simply reset them:

python

data = pd.concat([data0, data1], axis=0, ignore_index=True)
print(data)

# Output (example):
#    temperature  pressure  humidity  precipitation      wind
# 0     1.001251  2.545684  9.666708            NaN       NaN
# 1     2.902656  9.816877  7.251775            NaN       NaN
# 2     3.361788  8.067900  3.038989            NaN       NaN
# 3     9.847927  4.671510  9.594601            NaN       NaN
# 4     5.341185  3.827158  4.114981            NaN       NaN
# 5     3.272974       NaN  4.228730       4.891393  1.531332
# 6     4.177929       NaN  5.563462       5.301486  6.474081
# 7     8.423616       NaN  9.836578       6.799374  9.643460
# 8     7.839116       NaN  2.614979       6.346595  9.741067
# 9     8.017644       NaN  4.284372       7.170777  0.935709

merge

The merge function is used to combine DataFrames based on one or more keys.

Unlike concat, which concatenates along an axis, merge is more akin to a relational database join operation.

It associates rows from two DataFrames based on specified keys, similar to the pivot functionality in Excel.

Main parameters include:

left: The first DataFrame.

right: The second DataFrame.

on: The join key(s), column name(s) existing in both DataFrames.

left_on: Join key(s) from the left DataFrame.

right_on: Join key(s) from the right DataFrame.

left_index: Whether to use the left DataFrame’s index as the join key.

right_index: Whether to use the right DataFrame’s index as the join key.

how: Join method. Options are 'inner' (inner join), 'outer' (outer join), 'left' (left join), 'right' (right join).

suffixes: Suffixes to add when join column names overlap.

Let’s demonstrate using the random data above:

python

left = data0.copy()
right = data1.copy()

df_merge = pd.merge(left, right, on='temperature', how='inner')
print(df_merge)

# Output:
# Empty DataFrame
# Columns: [temperature, pressure, humidity_x, precipitation, humidity_y, wind]
# Index: []

Unlike concat, which matches based on column/row names, merge requires identical values to join two datasets. Since our specified temperature random numbers have no duplicates, the result is empty.

Let’s modify it slightly:

python

right['temperature'] = left['temperature'] # Copy temperature column to ensure matches.

df_merge = pd.merge(left, right, on='temperature', how='inner')
print(df_merge)

# Output (example):
#    temperature  pressure  humidity_x  precipitation  humidity_y      wind
# 0     2.902656  9.816877    7.251775       4.891393    4.228730  1.531332
# 1     3.361788  8.067900    3.038989       5.301486    5.563462  6.474081
# 2     9.847927  4.671510    9.594601       6.799374    9.836578  9.643460
# 3     5.341185  3.827158    4.114981       6.346595    2.614979  9.741067

Of course, we can also try other join methods:

python

# Join method 'outer', taking the union.
df_merge = pd.merge(left, right, on='temperature', how='outer')
print(df_merge)

# Join method 'left', keeping only rows corresponding to keys in the left table.
df_merge = pd.merge(left, right, on='temperature', how='left')
print(df_merge)

# Join method 'right', keeping only rows corresponding to keys in the right table.
df_merge = pd.merge(left, right, on='temperature', how='right')
print(df_merge)

# Output (example - outer):
#    temperature  pressure  humidity_x  precipitation  humidity_y      wind
# 0     1.001251  2.545684    9.666708            NaN         NaN       NaN
# 1     2.902656  9.816877    7.251775       4.891393    4.228730  1.531332
# 2     3.361788  8.067900    3.038989       5.301486    5.563462  6.474081
# 3     9.847927  4.671510    9.594601       6.799374    9.836578  9.643460
# 4     5.341185  3.827158    4.114981       6.346595    2.614979  9.741067
# 5          NaN       NaN         NaN       7.170777    4.284372  0.935709
# ... (left and right similar pattern) ...

Or directly use index values (row labels) for joining:

python

# Merge tables based on row index.
# Briefly try the effect of suffixes.
df_merge = pd.merge(left, right, left_index=True, right_index=True, how='outer', suffixes=('_L', '_R'))
print(df_merge)

# Output (example):
#               temperature_L  pressure  humidity_L  temperature_R  precipitation  \
# 2000-12-31       1.001251  2.545684    9.666708            NaN            NaN
# 2001-12-31       2.902656  9.816877    7.251775       2.902656       4.891393
# 2002-12-31       3.361788  8.067900    3.038989       3.361788       5.301486
# 2003-12-31       9.847927  4.671510    9.594601       9.847927       6.799374
# 2004-12-31       5.341185  3.827158    4.114981       5.341185       6.346595
# 2005-12-31            NaN       NaN         NaN            NaN       7.170777
#
#               humidity_R      wind
# 2000-12-31         NaN       NaN
# 2001-12-31    4.228730  1.531332
# 2002-12-31    5.563462  6.474081
# 2003-12-31    9.836578  9.643460
# 2004-12-31    2.614979  9.741067
# 2005-12-31    4.284372  0.935709

In some scenarios, column names for corresponding data in two tables differ, or we need to join based on multiple columns. Then we need to specify different column names for indexing:

python

right['tas'] = left['temperature'] # Rename columns for demonstration.
right['rh'] = left['humidity']

df_merge = pd.merge(left, right, left_on=['temperature', 'humidity'], right_on=['tas', 'rh'], how='inner', suffixes=('_α', '_β'))
print(df_merge)

# Output (example):
#    temperature_α  pressure  humidity_α  temperature_β  precipitation  \
# 0       2.902656  9.816877    7.251775       2.902656       4.891393
# 1       3.361788  8.067900    3.038989       3.361788       5.301486
# 2       9.847927  4.671510    9.594601       9.847927       6.799374
# 3       5.341185  3.827158    4.114981       5.341185       6.346595
#
#    humidity_β      wind       tas        rh
# 0    4.228730  1.531332  2.902656  7.251775
# 1    5.563462  6.474081  3.361788  3.038989
# 2    9.836578  9.643460  9.847927  9.594601
# 3    2.614979  9.741067  5.341185  4.114981

join

This function is similar to a combination of the above two. It merges two DataFrame objects based on their index (row labels).

Its usage is akin to pd.merge(left, right, left_index=True, right_index=True, suffixes=('_L', '_R')).

Let’s directly see the effect of executing such statements:

python

left = data0.copy()
right = data1.copy()

df_join = left.join(right, how='outer', lsuffix='_L', rsuffix='_R')
print(df_join)

df_join = left.join(right, how='inner', lsuffix='_L', rsuffix='_R')
print(df_join)

df_join = left.join(right, how='left', lsuffix='_L', rsuffix='_R')
print(df_join)

df_join = left.join(right, how='right', lsuffix='_L', rsuffix='_R')
print(df_join)

# Output (example - outer):
#               temperature_L  pressure  humidity_L  temperature_R  precipitation  \
# 2000-12-31       1.001251  2.545684    9.666708            NaN            NaN
# 2001-12-31       2.902656  9.816877    7.251775       3.272974       4.891393
# 2002-12-31       3.361788  8.067900    3.038989       4.177929       5.301486
# 2003-12-31       9.847927  4.671510    9.594601       8.423616       6.799374
# 2004-12-31       5.341185  3.827158    4.114981       7.839116       6.346595
# 2005-12-31            NaN       NaN         NaN       8.017644       7.170777
#
#               humidity_R      wind
# 2000-12-31         NaN       NaN
# 2001-12-31    4.228730  1.531332
# 2002-12-31    5.563462  6.474081
# 2003-12-31    9.836578  9.643460
# 2004-12-31    2.614979  9.741067
# 2005-12-31    4.284372  0.935709
# ... (inner, left, right similar) ...

Postscript

The above covers the basic content of concatenating data using Pandas. As for its practical application, the most straightforward example might be batch reading meteorological data stored on a yearly basis and merging them into a complete time series for subsequent processing.

Easy Python

Pandas 05 | How to Use Pandas to Concatenate Data?

New Article

Related articles