> For the complete documentation index, see [llms.txt](https://zeliang-yao.gitbook.io/my-note-zeliang-yao/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://zeliang-yao.gitbook.io/my-note-zeliang-yao/useful/pandas/useful-tricks.md).

# Useful tricks

1. 自定义pandas选项，设置
2. 实用pandas中testing模块构建测试数据
3. 巧用accessor接口方法
4. 合并其他列拼接DatetimeIndex
5. 使用分类数据（Categorical Data）节省时间和空间
6. 利用Mapping巧妙实现映射
7. 压缩pandas对象
8. 源码及GitHub地址

## 1. 自定义pandas选项，设置 <a href="#zi-ding-yi-pandas-xuan-xiang-she-zhi" id="zi-ding-yi-pandas-xuan-xiang-she-zhi"></a>

首先，大家可能不知道，pandas里面有一个方法pd.set\_option()，利用它我们可以改变一些pandas中默认的核心设置，\
&#x20;从而适应我们自身的需要，开始前还是老样子，让我们先导入numpy和pandas包

```python
import numpy as np
import pandas as pd
f'Using {pd.__name__}, Version {pd.__version__}'
```

现在让我们编写一个start方法来实现自定义pandas设置

```python
def start():
    options = {
        'display': {
            'max_columns': None,
            'max_colwidth': 25,
            'expand_frame_repr': False,  
            'max_rows': 14,
            'max_seq_items': 50,         
            'precision': 4,
            'show_dimensions': False
        },
        'mode': {
            'chained_assignment': None   
        }
    }

    for category, option in options.items():
        for op, value in option.items():
            pd.set_option(f'{category}.{op}', value)  

if __name__ == '__main__':
    start()
    del start  
```

大家可以发现，我们在方法的最后调用了pandas的set\_option方法，直接利用我们自定义的参数替代了原有的pandas参数，现在让我们测试一下：

```python
pd.get_option('display.max_rows')
```

```
14
```

可以发现max\_rows 已经被替换成了我们设置的14，现在用一个真实的例子，我们利用一组公开的鲍鱼各项指标的数据来实验，数据源来自机器学习平台的公开数据

```python
url = ('https://archive.ics.uci.edu/ml/'
       'machine-learning-databases/abalone/abalone.data')
cols = ['sex', 'length', 'diam', 'height', 'weight', 'rings']
abalone = pd.read_csv(url, usecols=[0, 1, 2, 3, 4, 8], names=cols)
abalone
```

|      | sex | length | diam  | height | weight | rings |
| ---- | --- | ------ | ----- | ------ | ------ | ----- |
| 0    | M   | 0.455  | 0.365 | 0.095  | 0.5140 | 15    |
| 1    | M   | 0.350  | 0.265 | 0.090  | 0.2255 | 7     |
| 2    | F   | 0.530  | 0.420 | 0.135  | 0.6770 | 9     |
| 3    | M   | 0.440  | 0.365 | 0.125  | 0.5160 | 10    |
| 4    | I   | 0.330  | 0.255 | 0.080  | 0.2050 | 7     |
| 5    | I   | 0.425  | 0.300 | 0.095  | 0.3515 | 8     |
| 6    | F   | 0.530  | 0.415 | 0.150  | 0.7775 | 20    |
| ...  | ... | ...    | ...   | ...    | ...    | ...   |
| 4170 | M   | 0.550  | 0.430 | 0.130  | 0.8395 | 10    |
| 4171 | M   | 0.560  | 0.430 | 0.155  | 0.8675 | 8     |
| 4172 | F   | 0.565  | 0.450 | 0.165  | 0.8870 | 11    |
| 4173 | M   | 0.590  | 0.440 | 0.135  | 0.9660 | 10    |
| 4174 | M   | 0.600  | 0.475 | 0.205  | 1.1760 | 9     |
| 4175 | F   | 0.625  | 0.485 | 0.150  | 1.0945 | 10    |
| 4176 | M   | 0.710  | 0.555 | 0.195  | 1.9485 | 12    |

我们可以看到，数据截断为14行，保留了小数点后4位小数作为精度，和我们刚刚设置的precision=4是一样的

## 2. testing模块构建测试数据 <a href="#shi-yong-pandas-zhong-testing-mo-kuai-gou-jian-ce-shi-shu-ju" id="shi-yong-pandas-zhong-testing-mo-kuai-gou-jian-ce-shi-shu-ju"></a>

通过pandas.util.testing提供的方法，我们可以很容易的通过几行代码就构建出一个简单的测试数据类型，比如我们现在构建一个DataTime类型的数据，时间间隔为月：

```python
import pandas.util.testing as tm
tm.N, tm.K = 15, 3         

import numpy as np
np.random.seed(444)

tm.makeTimeDataFrame(freq='M').head() 

```

|            | A       | B       | C       |
| ---------- | ------- | ------- | ------- |
| 2000-01-31 | 0.3574  | -0.8804 | 0.2669  |
| 2000-02-29 | 0.3775  | 0.1526  | -0.4803 |
| 2000-03-31 | 1.3823  | 0.2503  | 0.3008  |
| 2000-04-30 | 1.1755  | 0.0785  | -0.1791 |
| 2000-05-31 | -0.9393 | -0.9039 | 1.1837  |

瞎生成一组乱七八糟的数据：

```
tm.makeDataFrame().head()
```

|            | A       | B       | C       |
| ---------- | ------- | ------- | ------- |
| nTLGGTiRHF | -0.6228 | 0.6459  | 0.1251  |
| WPBRn9jtsR | -0.3187 | -0.8091 | 1.1501  |
| 7B3wWfvuDA | -1.9872 | -1.0795 | 0.2987  |
| yJ0BTjehH1 | 0.8802  | 0.7403  | -1.2154 |
| 0luaYUYvy1 | -0.9320 | 1.2912  | -0.2907 |

关于可以随机生成的数据类型, 一共大概有30多种，大家如果感兴趣可以多试试：

```python
[i for i in dir(tm) if i.startswith('make')]
```

```python
['makeBoolIndex',
 'makeCategoricalIndex',
 'makeCustomDataframe',
 'makeCustomIndex',
 'makeDataFrame',
 'makeDateIndex',
 'makeFloatIndex',
 'makeFloatSeries',
 'makeIntIndex',
 'makeIntervalIndex',
 'makeMissingCustomDataframe',
 'makeMissingDataframe',
 'makeMixedDataFrame',
 'makeMultiIndex',
 'makeObjectSeries',
 'makePanel',
 'makePeriodFrame',
 'makePeriodIndex',
 'makePeriodPanel',
 'makePeriodSeries',
 'makeRangeIndex',
 'makeStringIndex',
 'makeStringSeries',
 'makeTimeDataFrame',
 'makeTimeSeries',
 'makeTimedeltaIndex',
 'makeUIntIndex',
 'makeUnicodeIndex']
```

这样我们如果有测试的需求，会很容易地构建相对应的假数据来测试。

## 3. 巧用accessor接口方法 <a href="#qiao-yong-accessor-jie-kou-fang-fa" id="qiao-yong-accessor-jie-kou-fang-fa"></a>

accessor（访问器） 具体就是类似getter和setter，当然，Python里面不提倡存在setter和getter方法，但是这样可以便于大家理解，pandas Series类型有3类accessor：

```python
pd.Series._accessors
```

```python
{'cat', 'dt', 'str'}
```

* .cat用于分类数据，
* .str用于字符串（对象）数据，
* .dt用于类似日期时间的数据。

让我们从.str开始看：假设现在我们有一些原始的城市/州/ 邮编数据作为Dataframe的一个字段：

```python
addr = pd.Series([
    'Washington, D.C. 20003',
    'Brooklyn, NY 11211-1755',
    'Omaha, NE 68154',
    'Pittsburgh, PA 15211'
])
```

```python
addr.str.upper()  
```

```python
0     WASHINGTON, D.C. 20003
1    BROOKLYN, NY 11211-1755
2            OMAHA, NE 68154
3       PITTSBURGH, PA 15211
dtype: object
```

```python
addr.str.count(r'\d')  
```

```python
0    5
1    9
2    5
3    5
dtype: int64
```

如果我们想把每一行分成城市，州，邮编分开，可以用正则；

```python
regex = (r'(?P<city>[A-Za-z ]+), '      
         r'(?P<state>[A-Z]{2}) '      
         r'(?P<zip>\d{5}(?:-\d{4})?)')  

addr.str.replace('.', '').str.extract(regex)
```

|   | city       | state | zip        |
| - | ---------- | ----- | ---------- |
| 0 | Washington | DC    | 20003      |
| 1 | Brooklyn   | NY    | 11211-1755 |
| 2 | Omaha      | NE    | 68154      |
| 3 | Pittsburgh | PA    | 15211      |

第二个访问器.dt用于类似日期时间的数据。它其实属于Pandas的DatetimeIndex，如果在Series上调用，它首先转换为DatetimeIndex

```python
daterng = pd.Series(pd.date_range('2018', periods=9, freq='Q'))  
daterng
```

```python
0   2018-03-31
1   2018-06-30
2   2018-09-30
3   2018-12-31
4   2019-03-31
5   2019-06-30
6   2019-09-30
7   2019-12-31
8   2020-03-31
dtype: datetime64[ns]
```

```python
daterng.dt.day_name()
```

```python
0    Saturday
1    Saturday
2      Sunday
3      Monday
4      Sunday
5      Sunday
6      Monday
7     Tuesday
8     Tuesday
dtype: object
```

```python
daterng[daterng.dt.quarter > 2]  
```

```python
2   2018-09-30
3   2018-12-31
6   2019-09-30
7   2019-12-31
dtype: datetime64[ns]
```

```python
daterng[daterng.dt.is_year_end]  
```

```python
3   2018-12-31
7   2019-12-31
dtype: datetime64[ns]
```

最后有关.cat访问器我们会在第5个技巧中提到

## 4. 合并其他列拼接DatetimeIndex <a href="#he-bing-qi-ta-lie-pin-jie-datetimeindex" id="he-bing-qi-ta-lie-pin-jie-datetimeindex"></a>

现在先让我们构建一个包含时间类型数据的Dataframe：

```python
from itertools import product
datecols = ['year', 'month', 'day']

df = pd.DataFrame(list(product([2017, 2016], [1, 2], [1, 2, 3])),
                  columns=datecols)
df['data'] = np.random.randn(len(df))
df
```

|    | year | month | day | data    |
| -- | ---- | ----- | --- | ------- |
| 0  | 2017 | 1     | 1   | -0.0767 |
| 1  | 2017 | 1     | 2   | -1.2798 |
| 2  | 2017 | 1     | 3   | 0.4032  |
| 3  | 2017 | 2     | 1   | 1.2377  |
| 4  | 2017 | 2     | 2   | -0.2060 |
| 5  | 2017 | 2     | 3   | 0.6187  |
| 6  | 2016 | 1     | 1   | 2.3786  |
| 7  | 2016 | 1     | 2   | -0.4730 |
| 8  | 2016 | 1     | 3   | -2.1505 |
| 9  | 2016 | 2     | 1   | -0.6340 |
| 10 | 2016 | 2     | 2   | 0.7964  |
| 11 | 2016 | 2     | 3   | 0.0005  |

我们可以发现year，month，day是分开的三列，我们如果想要把它们合并为完整的时间并作为df的索引，可以这么做:

```python
df.index = pd.to_datetime(df[datecols])
df.head()
```

|            | year | month | day | data    |
| ---------- | ---- | ----- | --- | ------- |
| 2017-01-01 | 2017 | 1     | 1   | -0.0767 |
| 2017-01-02 | 2017 | 1     | 2   | -1.2798 |
| 2017-01-03 | 2017 | 1     | 3   | 0.4032  |
| 2017-02-01 | 2017 | 2     | 1   | 1.2377  |
| 2017-02-02 | 2017 | 2     | 2   | -0.2060 |

我们可以扔掉没用的列并把这个df压缩为Series：

```python
df = df.drop(datecols, axis=1).squeeze()
df.head()
```

```python
2017-01-01   -0.0767
2017-01-02   -1.2798
2017-01-03    0.4032
2017-02-01    1.2377
2017-02-02   -0.2060
Name: data, dtype: float64
```

```python
type(df)
```

```
pandas.core.series.Series
```

```
df.index.dtype_str
```

```
'datetime64[ns]'
```

## 5. 使用分类数据（Categorical Data）节省时间和空间 <a href="#shi-yong-fen-lei-shu-ju-categoricaldata-jie-sheng-shi-jian-he-kong-jian" id="shi-yong-fen-lei-shu-ju-categoricaldata-jie-sheng-shi-jian-he-kong-jian"></a>

刚刚我们在第3个技巧的时候提到了访问器，现在让我们来看最后一个.cat

pandas中Categorical这个数据类型非常强大，通过类型转换可以让我们节省变量在内存占用的空间，提高运算速度，不过有关具体的pandas加速实战，我会在\
&#x20;下一期说，现在让我们来看一个小栗子：

```python
colors = pd.Series([
    'periwinkle',
    'mint green',
    'burnt orange',
    'periwinkle',
    'burnt orange',
    'rose',
    'rose',
    'mint green',
    'rose',
    'navy'
])

import sys
colors.apply(sys.getsizeof)

```

```python
0    59
1    59
2    61
3    59
4    61
5    53
6    53
7    59
8    53
9    53
dtype: int64
```

我们首先创建了一个Series，填充了各种颜色，接着查看了每个地址对应的颜色所占内存的大小

> **注意这里我们使用sys.getsizeof()来获取占内存大小，但是实际上空格也是占内存的，sys.getsizeof(’’)返回的是49bytes**

接下来我们想把每种颜色用占内存更少的数字来表示（机器学习种非常常见），这样可以减少占用的内存，首先让我们创建一个mapper字典，给每一种颜色指定\
&#x20;一个数字

```python
mapper = {v: k for k, v in enumerate(colors.unique())}
mapper
```

```python
{'periwinkle': 0, 'mint green': 1, 'burnt orange': 2, 'rose': 3, 'navy': 4}
```

接着我们把刚才的colors数组转化为int类型：

```python
as_int = colors.map(mapper)
as_int
```

```python
0    0
1    1
2    2
3    0
4    2
5    3
6    3
7    1
8    3
9    4
dtype: int64
```

再让我们看一下占用的内存：

```
as_int.apply(sys.getsizeof)
```

```python
0    24
1    28
2    28
3    24
4    28
5    28
6    28
7    28
8    28
9    28
dtype: int64
```

现在可以观察到我们的内存占用的空间几乎是之前的一半，其实，刚刚我们做的正是模拟Categorical Data的转化原理。现在让我们直接调用一下：

```python
colors.memory_usage(index=False, deep=True)

Out:650
```

```python
colors.astype('category').memory_usage(index=False, deep=True)

Out: 495
```

大家可能感觉节省的空间并不是非常大对不对？ 因为目前我们这个数据根本不是真实场景，我们仅仅把数据容量增加10倍，现在再让我们看看效果：

```python
manycolors = colors.repeat(10)
len(manycolors) / manycolors.nunique()  

Out:20.0
```

```python
f"Not using category : { manycolors.memory_usage(index=False, deep=True)}"

'Not using category : 6500'
```

```python
f"Using category : { manycolors.astype('category').memory_usage(index=False, deep=True)}"

'Using category : 585'
```

这回内存的占用量差距明显就出来了，现在让我们用.cat来简化一下刚刚的工作：

```python
new_colors = colors.astype('category')
new_colors
```

```python
0      periwinkle
1      mint green
2    burnt orange
3      periwinkle
4    burnt orange
5            rose
6            rose
7      mint green
8            rose
9            navy
dtype: category
Categories (5, object): [burnt orange, mint green, navy, periwinkle, rose]
```

```python
new_colors.cat.categories   
```

```python
Index(['burnt orange', 'mint green', 'navy', 'periwinkle', 'rose'], dtype='object')
```

现在让我们查看把颜色代表的数字：

```python
new_colors.cat.codes
```

```python
0    3
1    1
2    0
3    3
4    0
5    4
6    4
7    1
8    4
9    2
dtype: int8
```

我们如果不满意顺序也可以从新排序：

```python
new_colors.cat.reorder_categories(mapper).cat.codes
```

```python
0    0
1    1
2    2
3    0
4    2
5    3
6    3
7    1
8    3
9    4
dtype: int8
```

有关cat其他的方法，我们还是可以通过遍历dir来查看：

```python
[i for i in dir(new_colors.cat) if not i.startswith('_')]
```

```python
['add_categories',
 'as_ordered',
 'as_unordered',
 'categories',
 'codes',
 'ordered',
 'remove_categories',
 'remove_unused_categories',
 'rename_categories',
 'reorder_categories',
 'set_categories']
```

> Categorical 数据通常不太灵活，比如我们不能直接在new\_colors上新增一个新的颜色，要首先通过\
> &#x20;.add\_categories来添加

```python
ccolors.iloc[5] = 'a new color'
```

```
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-36-1766a795336d> in <module>()
----> 1 ccolors.iloc[5] = 'a new color'


NameError: name 'ccolors' is not defined
```

```
new_colors = new_colors.cat.add_categories(['a new color'])
```

```
new_colors.iloc[5] = 'a new color'  
```

```
new_colors.values  
```

## 6. 利用Mapping巧妙实现映射 <a href="#li-yong-mapping-qiao-miao-shi-xian-ying-she" id="li-yong-mapping-qiao-miao-shi-xian-ying-she"></a>

假设现在我们有存贮国家的一组数据，和一组用来映射国家所对应的大洲的数据：

```python
countries = pd.Series([
    'United States',
    'Canada',
    'Mexico',
    'Belgium',
    'United Kingdom',
    'Thailand'
])

groups = {
    'North America': ('United States', 'Canada', 'Mexico', 'Greenland'),
    'Europe': ('France', 'Germany', 'United Kingdom', 'Belgium')
}
```

我们可以通过下面的方法来实现简单的映射：

```python
from typing import Any

def membership_map(s: pd.Series, groups: dict,
                   fillvalue: Any=-1) -> pd.Series:
    
    groups = {x: k for k, v in groups.items() for x in v}
    return s.map(groups).fillna(fillvalue)
```

```
 membership_map(countries, groups, fillvalue='other')
```

很简单对不对，现在让我们看一下最关键的一行代码，groups = {x: k for k, v in groups.items() for x in v}，这个是我之前提到过的字典推导式：

```
test = dict(enumerate(('ab', 'cd', 'xyz')))
{x: k for k, v in test.items() for x in v}
```

## 7. 压缩pandas <a href="#ya-suo-pandas-dui-xiang" id="ya-suo-pandas-dui-xiang"></a>

如果你的pandas版本大于0.21.0，那么都可以直接把pandas用压缩形式写入，常见的类型有gzip, bz2, zip，这里我们直接用刚才鲍鱼的数据集

```python
abalone.to_json('df.json.gz', orient='records',lines=True, compression='gzip')  
abalone.to_json('df.json', orient='records', lines=True)                        
```

```python
import os.path
os.path.getsize('df.json') / os.path.getsize('df.json.gz')  
```

### 8. 其他

```python
def create_id(*args):
	return ''.join([i for i in map(str,args) if i!='Nan'])
df['id'] = np.vectorize(create_id)(df[1],df[2],df[3])

# Concatenate all values to one column
df['x'] = df.astype(str).values.sum(axis=1)

#A clean way to initialize data frame with a list of namedtuple
Point = namedtuple('Point', ['x', 'y'])
points = [Point(1, 2), Point(3, 4)] 
pd.DataFrame(points, columns=Point._fields)
Out[13]: 
   x  y
0  1  2
1  3  4
```

## 9. 总结 <a href="#yuan-ma-ji-github-di-zhi" id="yuan-ma-ji-github-di-zhi"></a>

* Github仓库地址： [https://github.com/yaozeliang/pandas\_share](https://github.com/yaozeliang/pandas_share/tree/master/Pandas%E4%B9%8B%E6%97%85_04%20pandas%E8%B6%85%E5%AE%9E%E7%94%A8%E6%8A%80%E5%B7%A7)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://zeliang-yao.gitbook.io/my-note-zeliang-yao/useful/pandas/useful-tricks.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
