[pandas] DataFrame自定义排序

作者：不正经 | 2024-03-18 09:13:15

踩

dataframe自定义排序

目标

工作中经常遇到需要根据某一个变量进行自定义排序，例如要排序长度字段["0-12","12-30","30-60","60-120","120-180","180-240","240-300","300+"]，而这种又不可能直接用sort_value()函数直接实现

解决方法

方法1（推荐）

使用pandas的CategoricalDtype，将无序的字段转化为自定义的顺序。
然后将DataFrame中的相应字段用astype强制转化为这一种新建立的CategoricalDtype。
注意：这个方法一定要让orderLIst的字段与目标表格的values相对应，不然不在orderList里的values会被astype变成nan

import pandas as pd
from pandas.api.types import CategoricalDtype
def genOrder(df,orderList,colName):
    '''
    按自定排序函数
    orderList最好是穷尽df[colName]的values
    Args:
        df: 要排序的目标表格
        orderList: 顺序，e.g. ["0-12","12-30","30-60","60-120","120-180","180-240","240-300","300+"]
        colName: 顺序列的名称 e.g. 'explore_locale'
    Return:
        df: 将原有的colName变为有序，可以直接进行排序
    '''
    # 1. 建立新的有序类型
    cat_order = CategoricalDtype(orderList,ordered=True)
    # 2. 将目标字段转化为该有序类型
    df[colName] = df[colName].astype(cat_order)
    return df
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

举例

建立一个测试表

test = pd.DataFrame({'a':['a','c','b'],'b':[1,'b',2]})
print(test)
print(test.info())
1
2
3

output:
   a  b
0  a  1
1  c  b
2  b  2
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       3 non-null      object
 1   b       3 non-null      object
dtypes: object(2)
1
2
3
4
5
6
7
8
9
10
11
12
13

注意到这里字段的Dtype都是object
然后使用自建函数进行转换

test = genOrder(test,['a','b','c'],'a')
test = genOrder(test,[1,2,'b'],'b')
print(test)
print(test.info())
1
2
3
4

可以发现结果中字段转变为category，就可以直接使用sort_values()函数排序了。

   a  b
0  a  1
2  b  2
1  c  b
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   a       3 non-null      category
 1   b       3 non-null      category
dtypes: category(2)
1
2
3
4
5
6
7
8
9
10
11
12

方法2

这是之前一直在使用的一个比较笨的方法
就是建立一个新的DataFrame，包含两个字段一个字段为要排序的字段["0-12","12-30","30-60","60-120","120-180","180-240","240-300","300+"]，另一个为数字字段[1,2,3,4,5,6,7] 将这个新的DataFrame与目标表格merge之后就可以根据数字字段排序了。
这个方法思路很直观，但是不是特别推荐，因为一旦column是multipluIndex的时候就会出错。

def genOrder(df,orderList,colName):
    '''
    自定义排序函数
    Args:
        df: 要排序的目标表格
        orderList: 顺序，e.g. ['br', 'spa', 'in', 'pak', 'egy', 'tur']
        colName: 顺序列的名称 e.g. 'explore_locale'
    Return:
        df: 在原有的dataFrame上增加新的一列名称为{}rank，并且按照这一列排序
    '''
    orderDf = pd.DataFrame({
        '{}rank'.format(colName):[i for i in range(len(orderList))],
        colName:orderList
    })
    tmpdf = orderDf.merge(df,on=colName).sort_values('{}rank'.format(colName))
    return tmpdf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

完整代码

import pandas as pd
from pandas.api.types import CategoricalDtype
# def genOrder(df,orderList,colName):
#     '''
#     自定义排序函数
#     Args:
#         df: 要排序的目标表格
#         orderList: 顺序，e.g. ['br', 'spa', 'in', 'pak', 'egy', 'tur']
#         colName: 顺序列的名称 e.g. 'explore_locale'
#     Return:
#         df: 在原有的dataFrame上增加新的一列名称为{}rank，并且按照这一列排序
#     '''
#     orderDf = pd.DataFrame({
#         '{}rank'.format(colName):[i for i in range(len(orderList))],
#         colName:orderList
#     })
#     tmpdf = orderDf.merge(df,on=colName).sort_values('{}rank'.format(colName))
#     return tmpdf
def genOrder(df,orderList,colName):
    '''
    按自定排序函数
    orderList最好是穷尽df[colName]的values
    Args:
        df: 要排序的目标表格
        orderList: 顺序，e.g. ['br', 'spa', 'in', 'pak', 'egy', 'tur']
        colName: 顺序列的名称 e.g. 'explore_locale'
    Return:
        df: 将原有的colName变为有序，可以直接进行排序
    '''
    cat_order = CategoricalDtype(orderList,ordered=True)
    df[colName] = df[colName].astype(cat_order)
    return df
if __name__ == '__main__':
    # 测试genOrder
    test = pd.DataFrame({'a':['a','c','b'],'b':[1,'b',2]})
    print(test)
    test = genOrder(test,['a','b','c'],'a').sort_values('a')
    test = genOrder(test,[1,2,'b'],'b')
    print(test)
    print(test.info())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

reference
dataframe 排序_如何对Pandas DataFrame进行自定义排序

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/不正经/article/detail/261953?site