# Merge, Join, Concat

## Merge, Join, Concat,Sort <a href="#mergejoin-concatsort" id="mergejoin-concatsort"></a>

主要会从以下几个方面和大家分享：

1. Merge
2. Join
3. Concat
4. 源码及GitHub地址

## 1. Merge <a href="#merge" id="merge"></a>

首先merge的操作非常类似sql里面的join，实现将两个Dataframe根据一些共有的列连接起来，当然，在实际场景中，这些共有列一般是Id，\
&#x20;连接方式也丰富多样，可以选择inner(默认)，left,right,outer 这几种模式，分别对应的是内连接，左连接，右连接

### 1.1 InnerMerge (内连接) <a href="#innermerge-nei-lian-jie" id="innermerge-nei-lian-jie"></a>

首先让我们简单的创建两个DF,分别为DataFrame1,DataFrame2,他们的公有列是key

```python
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
```

```python
dframe1 = DataFrame({'key':['X','Z','Y','Z','X','X'],'value_df1': np.arange(6)})
dframe1
```

|   | key | value\_df1 |
| - | --- | ---------- |
| 0 | X   | 0          |
| 1 | Z   | 1          |
| 2 | Y   | 2          |
| 3 | Z   | 3          |
| 4 | X   | 4          |
| 5 | X   | 5          |

```python

dframe2 = DataFrame({'key':['Q','Y','Z'],'value_df2':[1,2,3]})
dframe2
```

|   | key | value\_df2 |
| - | --- | ---------- |
| 0 | Q   | 1          |
| 1 | Y   | 2          |
| 2 | Z   | 3          |

我们现在可以简单地使用pd.merge(dframe1,dframe2)来实现Merge功能

```python
pd.merge(dframe1,dframe2)
```

|   | key | value\_df1 | value\_df2 |
| - | --- | ---------- | ---------- |
| 0 | Z   | 1          | 3          |
| 1 | Z   | 3          | 3          |
| 2 | Y   | 2          | 2          |

我们现在需要注意一点，X仅仅是存在于dframe1的key，在dframe2中不存在，因此大家可以发现，当我们调用pd.merge的时候，会自动默认为inner join，\
&#x20;我们再换一种方式写一下，大家就明白了：

```python
pd.merge(dframe1,dframe2,on='key',how='inner')
```

|   | key | value\_df1 | value\_df2 |
| - | --- | ---------- | ---------- |
| 0 | Z   | 1          | 3          |
| 1 | Z   | 3          | 3          |
| 2 | Y   | 2          |            |

### 1.2 LeftMerge (左连接) <a href="#leftmerge-zuo-lian-jie" id="leftmerge-zuo-lian-jie"></a>

```python
pd.merge(dframe1,dframe2,on='key',how='left')

```

|   | key | value\_df1 | value\_df2 |
| - | --- | ---------- | ---------- |
| 0 | X   | 0          | NaN        |
| 1 | Z   | 1          | 3.0        |
| 2 | Y   | 2          | 2.0        |
| 3 | Z   | 3          | 3.0        |
| 4 | X   | 4          | NaN        |
| 5 | X   | 5          | NaN        |

我们可以看到返回的是dframe1的所有key值对应的结果，如果在dframe2中不存在，显示为Nan空值

### 1.3 RightMerge (右连接) <a href="#rightmerge-you-lian-jie" id="rightmerge-you-lian-jie"></a>

右连接的原理和左连接正相反

```python
pd.merge(dframe1,dframe2,on='key',how='right')

```

|   | key | value\_df1 | value\_df2 |
| - | --- | ---------- | ---------- |
| 0 | Z   | 1.0        | 3          |
| 1 | Z   | 3.0        | 3          |
| 2 | Y   | 2.0        | 2          |
| 3 | Q   | NaN        | 1          |

这里Q只存在于drame2的key中

### 1.4 OuterMerge (全连接) <a href="#outermerge-quan-lian-jie" id="outermerge-quan-lian-jie"></a>

```python
pd.merge(dframe1,dframe2,on='key',how='outer')
```

|   | key | value\_df1 | value\_df2 |
| - | --- | ---------- | ---------- |
| 0 | X   | 0.0        | NaN        |
| 1 | X   | 4.0        | NaN        |
| 2 | X   | 5.0        | NaN        |
| 3 | Z   | 1.0        | 3.0        |
| 4 | Z   | 3.0        | 3.0        |
| 5 | Y   | 2.0        | 2.0        |
| 6 | Q   | NaN        | 1.0        |

### 1.5 MultipleKey Merge (基于多个key上的merge) <a href="#multiplekeymerge-ji-yu-duo-ge-key-shang-de-merge" id="multiplekeymerge-ji-yu-duo-ge-key-shang-de-merge"></a>

刚才我们都是仅仅实现的在一个key上的merge，当然我们也可以实现基于多个keys的merge

```python

df_left = DataFrame({'key1': ['SF', 'SF', 'LA'],
                  'key2': ['one', 'two', 'one'],
                  'left_data': [10,20,30]})
df_left
```

|   | key1 | key2 | left\_data |
| - | ---- | ---- | ---------- |
| 0 | SF   | one  | 10         |
| 1 | SF   | two  | 20         |
| 2 | LA   | one  | 30         |

```python

df_right = DataFrame({'key1': ['SF', 'SF', 'LA', 'LA'],
                   'key2': ['one', 'one', 'one', 'two'],
                   'right_data': [40,50,60,70]})
df_right
```

|   | key1 | key2 | right\_data |
| - | ---- | ---- | ----------- |
| 0 | SF   | one  | 40          |
| 1 | SF   | one  | 50          |
| 2 | LA   | one  | 60          |
| 3 | LA   | two  | 70          |

```python
pd.merge(df_left, df_right, on=['key1', 'key2'])
```

|   | key1 | key2 | left\_data | right\_data |
| - | ---- | ---- | ---------- | ----------- |
| 0 | SF   | one  | 10         | 40          |
| 1 | SF   | one  | 10         | 50          |
| 2 | LA   | one  | 30         | 60          |

```python
pd.merge(df_left, df_right, on=['key1', 'key2'],how='outer')
```

|   | key1 | key2 | left\_data | right\_data |
| - | ---- | ---- | ---------- | ----------- |
| 0 | SF   | one  | 10.0       | 40.0        |
| 1 | SF   | one  | 10.0       | 50.0        |
| 2 | SF   | two  | 20.0       | NaN         |
| 3 | LA   | one  | 30.0       | 60.0        |
| 4 | LA   | two  | NaN        | 70.0        |

这里还有一个地方非常有意思，大家可以发现现在df\_left,df\_right作为key的两列分别是key1和key2，它们的名字是相同的，刚刚我们是通过制定on=\[‘key1’, ‘key2’],那如果我们只指定一列会怎么样呢？

```python
pd.merge(df_left,df_right,on='key1')
```

|   | key1 | key2\_x | left\_data | key2\_y | right\_data |
| - | ---- | ------- | ---------- | ------- | ----------- |
| 0 | SF   | one     | 10         | one     | 40          |
| 1 | SF   | one     | 10         | one     | 50          |
| 2 | SF   | two     | 20         | one     | 40          |
| 3 | SF   | two     | 20         | one     | 50          |
| 4 | LA   | one     | 30         | one     | 60          |
| 5 | LA   | one     | 30         | two     | 70          |

大家可以看到pandas自动把key2这一列拆分成了key2\_x和key2\_y，都会显示在最后的merge结果里，如果我们想要给这两列重新命名，也是很容易的：

```python
pd.merge(df_left,df_right, on='key1',suffixes=('_lefty','_righty'))
```

|   | key1 | key2\_lefty | left\_data | key2\_righty | right\_data |
| - | ---- | ----------- | ---------- | ------------ | ----------- |
| 0 | SF   | one         | 10         | one          | 40          |
| 1 | SF   | one         | 10         | one          | 50          |
| 2 | SF   | two         | 20         | one          | 40          |
| 3 | SF   | two         | 20         | one          | 50          |
| 4 | LA   | one         | 30         | one          | 60          |
| 5 | LA   | one         | 30         | two          | 70          |

像这样，可以通过suffixes参数来指定拆分的列的名字。

### 1.6 Merge on Index (基于index上的merge) <a href="#mergeonindex-ji-yu-index-shang-de-merge" id="mergeonindex-ji-yu-index-shang-de-merge"></a>

```python
df_left = DataFrame({'key': ['X','Y','Z','X','Y'],'data': range(5)})
df_right = DataFrame({'group_data': [10, 20]}, index=['X', 'Y'])
```

```python
df_left
```

|   | key | data |
| - | --- | ---- |
| 0 | X   | 0    |
| 1 | Y   | 1    |
| 2 | Z   | 2    |
| 3 | X   | 3    |
| 4 | Y   | 4    |

```
df_right
```

|   | group\_data |
| - | ----------- |
| X | 10          |
| Y | 20          |

现在想要实现两个Dataframe的merge，但是条件是通过df\_left的Key和df\_right的Index

```python
pd.merge(df_left,df_right,left_on='key',right_index=True)
```

|   | key | data | group\_data |
| - | --- | ---- | ----------- |
| 0 | X   | 0    | 10          |
| 3 | X   | 3    | 10          |
| 1 | Y   | 1    | 20          |
| 4 | Y   | 4    | 20          |

这样我们也可以得到结果。

```python
pd.merge(df_left,df_right,left_on='key',right_index=True,how='outer')
```

|   | key | data | group\_data |
| - | --- | ---- | ----------- |
| 0 | X   | 0    | 10.0        |
| 3 | X   | 3    | 10.0        |
| 1 | Y   | 1    | 20.0        |
| 4 | Y   | 4    | 20.0        |
| 2 | Z   | 2    | NaN         |

其他的merge方式就类似啦，这里就不一一说了，只是举一个outer join的例子

```python
pd.merge(df_left,df_right,left_on='key',right_index=True,how='outer')
```

|   | key | data | group\_data |
| - | --- | ---- | ----------- |
| 0 | X   | 0    | 10.0        |
| 3 | X   | 3    | 10.0        |
| 1 | Y   | 1    | 20.0        |
| 4 | Y   | 4    | 20.0        |
| 2 | Z   | 2    | NaN         |

我们可以尝试其他的merge，比如如果一个df的index是多层嵌套的情况

```python
df_left_hr = DataFrame({'key1': ['SF','SF','SF','LA','LA'],
                   'key2': [10, 20, 30, 20, 30],
                   'data_set': np.arange(5.)})
df_right_hr = DataFrame(np.arange(10).reshape((5, 2)),
                   index=[['LA','LA','SF','SF','SF'],
                          [20, 10, 10, 10, 20]],
                   columns=['col_1', 'col_2'])
```

```
df_left_hr
```

|   | key1 | key2 | data\_set |
| - | ---- | ---- | --------- |
| 0 | SF   | 10   | 0.0       |
| 1 | SF   | 20   | 1.0       |
| 2 | SF   | 30   | 2.0       |
| 3 | LA   | 20   | 3.0       |
| 4 | LA   | 30   | 4.0       |

```
df_right_hr
```

|    |    | col\_1 | col\_2 |
| -- | -- | ------ | ------ |
| LA | 20 | 0      | 1      |
| 10 | 2  | 3      |        |
| SF | 10 | 4      | 5      |
| 10 | 6  | 7      |        |
| 20 | 8  | 9      |        |

现在我们穿建了两个Dataframe 分别是df\_left\_hr和df\_right\_hr（Index两层），如果我们想通过使用df\_left\_hr的key1，key2 及df\_right\_hr的Index作为merge 的列，也是没有问题的

```python
pd.merge(df_left_hr,df_right_hr,left_on=['key1','key2'],right_index=True)
```

|   | key1 | key2 | data\_set | col\_1 | col\_2 |
| - | ---- | ---- | --------- | ------ | ------ |
| 0 | SF   | 10   | 0.0       | 4      | 5      |
| 0 | SF   | 10   | 0.0       | 6      | 7      |
| 1 | SF   | 20   | 1.0       | 8      | 9      |
| 3 | LA   | 20   | 3.0       | 0      | 1      |

基本到这里，我已经和大家分享了基础的Merge有关的所有操作，如果你平时生活工作中经常使用Excel执行类似操作的话，可以学习一下Merge哈，它会大幅度\
&#x20;减轻你的工作强度的！

## 2.Join <a href="#join" id="join"></a>

现在我们可以接着来看join相关的操作，先让我们看一个小例子

```python
left = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 
                    'B': ['B0', 'B1', 'B2', 'B3']}, 
                    index = ['K0', 'K1', 'K2', 'K3']) 
  
right = pd.DataFrame({'C': ['C0', 'C1', 'C2', 'C3'], 
                      'D': ['D0', 'D1', 'D2', 'D3']}, 
                      index = ['K0', 'K1', 'K2', 'K3']) 
```

```
left
```

|    | A  | B  |
| -- | -- | -- |
| K0 | A0 | B0 |
| K1 | A1 | B1 |
| K2 | A2 | B2 |
| K3 | A3 | B3 |

```
right
```

|    | C  | D  |
| -- | -- | -- |
| K0 | C0 | D0 |
| K1 | C1 | D1 |
| K2 | C2 | D2 |
| K3 | C3 | D3 |

```python
left.join(right)
```

|    | A  | B  | C  | D  |
| -- | -- | -- | -- | -- |
| K0 | A0 | B0 | C0 | D0 |
| K1 | A1 | B1 | C1 | D1 |
| K2 | A2 | B2 | C2 | D2 |
| K3 | A3 | B3 | C3 | D3 |

其实通过这一个小例子大家也就明白了，join无非就是合并，默认是横向，还有一个点需要注意的是，我们其实可以通过join实现和merge一样的效果，但是为了\
&#x20;避免混淆，我不会多举其他的例子了，因为我个人认为一般情况下还是用merge函数好一些

## 3. Concat <a href="#concat" id="concat"></a>

为了更加全面彻底地了解Concat函数，大家可以先从一维的Numpy Array开始，首先让我们简单的创建一个矩阵：

```python

arr1 = np.arange(9).reshape((3,3))
arr1
```

```
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
```

让我们通过concatenate函数进行横向拼接

```python
np.concatenate([arr1,arr1],axis=1)
```

```python
array([[0, 1, 2, 0, 1, 2],
       [3, 4, 5, 3, 4, 5],
       [6, 7, 8, 6, 7, 8]])
```

再让我们进行纵向拼接：

```python
np.concatenate([arr1,arr1],axis=0)
```

```python
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8],
       [0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
```

有了基础的印象之后，现在让我们看看在pandas中是如何操作的：

```python
ser1 =  Series([0,1,2],index=['T','U','V'])
ser2 = Series([3,4],index=['X','Y'])

pd.concat([ser1,ser2])
```

```python
T    0
U    1
V    2
X    3
Y    4
dtype: int64
```

在上面的例子中，我们分别创建了两个没有重复Index的Series,然后用concat默认的把它们合并在一起，这时生成的依然是Series类型，如果我们把axis换成1，那生成的就是Dataframe,像下面一样

```python
pd.concat([ser1,ser2],axis=1,sort =True)  
```

|   | 0   | 1   |
| - | --- | --- |
| T | 0.0 | NaN |
| U | 1.0 | NaN |
| V | 2.0 | NaN |
| X | NaN | 3.0 |
| Y | NaN | 4.0 |

我们还可以指定在哪些index上进行concat:

```python
pd.concat([ser1,ser2],axis=1,join_axes=[['U','V','Y']])
```

|   | 0   | 1   |
| - | --- | --- |
| U | 1.0 | NaN |
| V | 2.0 | NaN |
| Y | NaN | 4.0 |

也可以给不同组的index加一层标签

```python
pd.concat([ser1,ser2],keys=['cat1','cat2'])
```

```python
cat1  T    0
      U    1
      V    2
cat2  X    3
      Y    4
dtype: int64
```

如果把axis换成是1，那么keys就会变成column的名字：

```python
pd.concat([ser1,ser2],axis=1,keys=['cat1','cat2'],sort=True)
```

|   | cat1 | cat2 |
| - | ---- | ---- |
| T | 0.0  | NaN  |
| U | 1.0  | NaN  |
| V | 2.0  | NaN  |
| X | NaN  | 3.0  |
| Y | NaN  | 4.0  |

如果是两个现成的dataframe直接进行concat也是一样：

```python
dframe1 = DataFrame(np.random.randn(4,3), columns=['X', 'Y', 'Z'])
dframe2 = DataFrame(np.random.randn(3, 3), columns=['Y', 'Q', 'X'])
```

```
dframe1
```

|   | X         | Y         | Z         |
| - | --------- | --------- | --------- |
| 0 | 1.119976  | -0.853960 | 0.027451  |
| 1 | -0.536831 | 0.982092  | -0.157650 |
| 2 | -0.219322 | -1.489809 | 1.607735  |
| 3 | 0.767249  | -1.661912 | 0.038837  |

```
dframe2
```

|   | Y         | Q        | X         |
| - | --------- | -------- | --------- |
| 0 | -0.035560 | 0.875282 | -1.630508 |
| 1 | -0.439484 | 0.096247 | 1.335693  |
| 2 | 0.746299  | 0.568684 | 1.197015  |

```python
pd.concat([dframe1,dframe2],sort=True)
```

|   | Q        | X         | Y         | Z         |
| - | -------- | --------- | --------- | --------- |
| 0 | NaN      | 1.119976  | -0.853960 | 0.027451  |
| 1 | NaN      | -0.536831 | 0.982092  | -0.157650 |
| 2 | NaN      | -0.219322 | -1.489809 | 1.607735  |
| 3 | NaN      | 0.767249  | -1.661912 | 0.038837  |
| 0 | 0.875282 | -1.630508 | -0.035560 | NaN       |
| 1 | 0.096247 | 1.335693  | -0.439484 | NaN       |
| 2 | 0.568684 | 1.197015  | 0.746299  | NaN       |

## 4. 总结 <a href="#yuan-ma-ji-github-di-zhi" id="yuan-ma-ji-github-di-zhi"></a>

Github仓库地址： [https://github.com/yaozeliang/pandas\_shar](https://github.com/yaozeliang/pandas_share)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://zeliang-yao.gitbook.io/my-note-zeliang-yao/useful/pandas/merge-join-concat.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
