深入理解Pandas的groupby函数

一束尘光

19318人浏览 · 2022-10-04 17:59:07

一束尘光 · 2022-10-04 17:59:07 发布

序

最近在学习Pandas，在处理数据时，经常需要对数据的某些字段进行分组分析，这就需要用到groupby函数，这篇文章做一个详细记录

Pandas版本 1.4.3

Pandas中的groupby函数先将DataFrame或Series按照关注字段进行拆分，将相同属性划分为一组，然后可以对拆分后的各组执行相应的转换操作，最后返回汇总转换后的各组结果

一、基本用法

先初始化一些数据，方便演示

import pandas as pd

df = pd.DataFrame({
            'name': ['香蕉', '菠菜', '糯米', '糙米', '丝瓜', '冬瓜', '柑橘', '苹果', '橄榄油'],
            'category': ['水果', '蔬菜', '米面', '米面', '蔬菜', '蔬菜', '水果', '水果', '粮油'],
            'price': [3.5, 6, 2.8, 9, 3, 2.5, 3.2, 8, 18],
            'count': [2, 1, 3, 6, 4, 8, 5, 3, 2]
        })

初始化数据
按category分组

grouped = df.groupby('category')
print(type(grouped))
print(grouped)

输出结果

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x127112df0>

grouped的类型是DataFrameGroupBy，直接尝试输出，打印是内存地址，不太直观，这里写一个函数来展示（可以这么写的原理，后面会介绍）

def view_group(the_pd_group):
    for name, group in the_pd_group:
        print(f'group name: {name}')
        print('-' * 30)
        print(group)
        print('=' * 30, '\n')
view_group(grouped)

输出结果

group name: 水果
------------------------------
    name  category  price  count
0   香蕉       水果    3.5      2
6   柑橘       水果    3.2      5
7   苹果       水果    8.0      3
============================== 
group name: 米面
------------------------------
    name  category  price  count
2   糯米       米面    2.8      3
3   糙米       米面    9.0      6
============================== 
group name: 粮油
------------------------------
   name    category  price  count
8  橄榄油       粮油   18.0      2
============================== 
group name: 蔬菜
------------------------------
    name  category  price  count
1   菠菜       蔬菜    6.0      1
4   丝瓜       蔬菜    3.0      4
5   冬瓜       蔬菜    2.5      8
==============================

二、参数源码探析

接下来看一下源码中的方法定义
DataFrame的groupby

def groupby(
        self,
        by=None,
        axis: Axis = 0,
        level: Level | None = None,
        as_index: bool = True,
        sort: bool = True,
        group_keys: bool = True,
        squeeze: bool | lib.NoDefault = no_default,
        observed: bool = False,
        dropna: bool = True,
    ) -> DataFrameGroupBy:
    pass

Series的groupby

def groupby(
        self,
        by=None,
        axis=0,
        level=None,
        as_index: bool = True,
        sort: bool = True,
        group_keys: bool = True,
        squeeze: bool | lib.NoDefault = no_default,
        observed: bool = False,
        dropna: bool = True,
    ) -> SeriesGroupBy:
    pass

Series的groupby函数操作与DataFrame类似，这篇文章只以DataFrame作为示例

入参

by

再来回忆一下基本用法里的写法

grouped = df.groupby('category')

这里传入的category就是第1个参数by，表示要按照什么进行分组，根据官方文档介绍，by可以是mapping, function, label, list of labels中的一种，这里是用的label，也就是说，还可以像下面这样写

label列表

grouped = df.groupby(['category'])

mapping
这种方式需要按DataFrame的index进行映射，这里把水果和蔬菜划分到大组蔬菜水果，米面和粮油划分到大组米面粮油

category_dict = {'水果': '蔬菜水果', '蔬菜': '蔬菜水果', '米面': '米面粮油', '粮油': '米面粮油'}
the_map = {}
for i in range(len(df.index)):
    the_map[i] = category_dict[df.iloc[i]['category']]
grouped = df.groupby(the_map)
view_group(grouped)

输出结果如下

group name: 米面粮油
------------------------------
    name  category  price  count
2   糯米       米面    2.8      3
3   糙米       米面    9.0      6
8  橄榄油      粮油   18.0      2
============================== 

group name: 蔬菜水果
------------------------------
    name  category  price  count
0   香蕉       水果    3.5      2
1   菠菜       蔬菜    6.0      1
4   丝瓜       蔬菜    3.0      4
5   冬瓜       蔬菜    2.5      8
6   柑橘       水果    3.2      5
7   苹果       水果    8.0      3
==============================

function
这种方式下，自定义函数的入参也是DataFrame的index，输出结果与mapping的例子相同

category_dict = {'水果': '蔬菜水果', '蔬菜': '蔬菜水果', '米面': '米面粮油', '粮油': '米面粮油'}

def to_big_category(the_idx):
    return category_dict[df.iloc[the_idx]['category']]
grouped = df.groupby(to_big_category)
view_group(grouped)

axis

axis表示以哪个轴作为分组的切分依据
0 - 等价于index, 表示按行切分，默认
1 - 等价于columns，表示按列切分

这里看一下按列切分的示例

def group_columns(column_name: str):
    if column_name in ['name', 'category']:
        return 'Group 1'
    else:
        return 'Group 2'
# 等价写法 grouped = df.head(3).groupby(group_columns, axis='columns')
grouped = df.head(3).groupby(group_columns, axis=1)
view_group(grouped)

输出结果如下

group name: Group 1
------------------------------
    name  category
0   香蕉       水果
1   菠菜       蔬菜
2   糯米       米面
============================== 

group name: Group 2
------------------------------
   price  count
0    3.5      2
1    6.0      1
2    2.8      3
==============================

相当于把表从垂直方向上切开，左半部分为Group 1，右半部分为Group 2

level

当axis是MultiIndex(层级结构)时，按特定的level进行分组，注意这里的level是int类型，从0开始，0表示第1层，以此类推

构造另一组带MultiIndex的测试数据

the_arrays = [['A', 'A', 'A', 'B', 'A', 'A', 'A', 'B', 'A', 'A'],
              ['蔬菜水果', '蔬菜水果', '米面粮油', '休闲食品', '米面粮油', '蔬菜水果', '蔬菜水果', '休闲食品', '蔬菜水果', '米面粮油'],
              ['水果', '蔬菜', '米面', '糖果', '米面', '蔬菜', '蔬菜', '饼干', '水果', '粮油']]
the_index = pd.MultiIndex.from_arrays(arrays=the_arrays, names=['one ', 'two', 'three'])
df_2 = pd.DataFrame(data=[3.5, 6, 2.8, 4, 9, 3, 2.5, 3.2, 8, 18], index=the_index, columns=['price'])
print(df_2)

输出结果如下

                     price
one  two  three       
A    蔬菜水果 水果       3.5
             蔬菜       6.0
     米面粮油 米面       2.8
B    休闲食品 糖果       4.0
A    米面粮油 米面       9.0
     蔬菜水果 蔬菜       3.0
             蔬菜       2.5
B    休闲食品 饼干       3.2
A    蔬菜水果 水果       8.0
     米面粮油 粮油      18.0

1. 按第3层分组

grouped = df_2.groupby(level=2)
view_group(grouped)

输出结果如下

group name: 水果
------------------------------
                      price
one  two    three       
A    蔬菜水果 水果       3.5
             水果       8.0
============================== 

group name: 米面
------------------------------
                     price
one  two    three       
A    米面粮油 米面       2.8
             米面       9.0
============================== 

group name: 粮油
------------------------------
                      price
one  two    three       
A    米面粮油 粮油      18.0
============================== 

group name: 糖果
------------------------------
                      price
one  two    three       
B    休闲食品 糖果       4.0
============================== 

group name: 蔬菜
------------------------------
                     price
one  two    three       
A    蔬菜水果 蔬菜       6.0
             蔬菜       3.0
             蔬菜       2.5
============================== 

group name: 饼干
------------------------------
                      price
one  two    three       
B    休闲食品 饼干       3.2
==============================

共6个分组

2. 按第1, 2层分组

grouped = df_2.groupby(level=[0, 1])
view_group(grouped)

输出结果如下

group name: ('A', '米面粮油')
------------------------------
                      price
one  two    three       
A    米面粮油 米面       2.8
             米面       9.0
             粮油      18.0
============================== 

group name: ('A', '蔬菜水果')
------------------------------
                      price
one  two    three       
A    蔬菜水果 水果       3.5
             蔬菜       6.0
             蔬菜       3.0
             蔬菜       2.5
             水果       8.0
============================== 

group name: ('B', '休闲食品')
------------------------------
                      price
one  two    three       
B    休闲食品 糖果       4.0
             饼干       3.2
==============================

共3个分组，可以看到，分组名称变成了元组

as_index

bool类型，默认值为True。对于聚合输出，返回对象以分组名作为索引

grouped = self.df.groupby('category', as_index=True)
print(grouped.sum())

as_index为 True 的输出结果如下

            price  count
category              
水果         14.7     10
米面         11.8      9
粮油         18.0      2
蔬菜         11.5     13

grouped = self.df.groupby('category', as_index=False)
print(grouped.sum())

as_index为 False 的输出结果如下，与SQL的groupby输出风格相似

    category  price  count
0       水果   14.7     10
1       米面   11.8      9
2       粮油   18.0      2
3       蔬菜   11.5     13

sort

bool类型，默认为True。是否对分组名进行排序，关闭自动排序可以提高性能。注意：对分组名排序并不影响分组内的顺序

group_keys

bool类型，默认为True
如果为True，调用apply时，将分组的keys添加到索引中

squeeze

1.1.0版本已废弃，不解释

observed

bool类型，默认值为False
仅适用于任何 groupers 是分类(Categoricals)的
如果为 True，仅显示分类分组的观察值；如果为 False ，显示分类分组的所有值

dropna

bool类型，默认值为True，1.1.0版本新增参数
如果为 True，且分组的keys中包含NA值，则 NA 值连同行(axis=0)/列(axis=1)将被删除
如果为 False，NA值也被视为分组的keys，不做处理

返回值

DateFrame的gropuby函数，返回类型是DataFrameGroupBy，而Series的groupby函数，返回类型是SeriesGroupBy
查看源码后发现他们都继承了BaseGroupBy，继承关系如图所示

BaseGroupBy类中有一个grouper属性，是ops.BaseGrouper类型，但BaseGroupBy类没有__init__方法，因此进入GroupBy类，该类重写了父类的grouper属性，在__init__方法中调用了grouper.py的get_grouper，下面是抽取出来的伪代码

groupby.py文件

class GroupBy(BaseGroupBy[NDFrameT]):
	grouper: ops.BaseGrouper
	
	def __init__(self, ...):
		# ...
		if grouper is None:
			from pandas.core.groupby.grouper import get_grouper
			grouper, exclusions, obj = get_grouper(...)

grouper.py文件

def get_grouper(...) -> tuple[ops.BaseGrouper, frozenset[Hashable], NDFrameT]:
	# ...
	# create the internals grouper
    grouper = ops.BaseGrouper(
        group_axis, groupings, sort=sort, mutated=mutated, dropna=dropna
    )
	return grouper, frozenset(exclusions), obj

class Grouping：
	"""
	obj : DataFrame or Series
	"""
	def __init__(
        self,
        index: Index,
        grouper=None,
        obj: NDFrame | None = None,
        level=None,
        sort: bool = True,
        observed: bool = False,
        in_axis: bool = False,
        dropna: bool = True,
    ):
    	pass

ops.py文件

class BaseGrouper:
    """
    This is an internal Grouper class, which actually holds
    the generated groups
    
    ......
    """
    def __init__(self, axis: Index, groupings: Sequence[grouper.Grouping], ...):
    	# ...
    	self._groupings: list[grouper.Grouping] = list(groupings)
    
    @property
    def groupings(self) -> list[grouper.Grouping]:
        return self._groupings

BaseGrouper中包含了最终生成的分组信息，是一个list，其中的元素类型为grouper.Grouping，每个分组对应一个Grouping，而Grouping中的obj对象为分组后的DataFrame或者Series

在第一部分写了一个函数来展示groupby返回的对象，这里再来探究一下原理，对于可迭代对象，会实现__iter__()方法，先定位到BaseGroupBy的对应方法

class BaseGroupBy:
	grouper: ops.BaseGrouper
	
	@final
    def __iter__(self) -> Iterator[tuple[Hashable, NDFrameT]]:
        return self.grouper.get_iterator(self._selected_obj, axis=self.axis)

接下来进入BaseGrouper类中

class BaseGrouper:
    def get_iterator(
        self, data: NDFrameT, axis: int = 0
    ) -> Iterator[tuple[Hashable, NDFrameT]]:
        splitter = self._get_splitter(data, axis=axis)
        keys = self.group_keys_seq
        for key, group in zip(keys, splitter):
            yield key, group.__finalize__(data, method="groupby")

Debug模式进入group.finalize()方法，发现返回的确实是DataFrame对象
BaseGroupBy的__iter__()方法Debug详情

三、4大函数

有了上面的基础，接下来再看groupby之后的处理函数，就简单多了

agg

聚合操作是groupby后最常见的操作，常用来做数据分析
比如，要查看不同category分组的最大值，以下三种写法都可以实现，并且grouped.aggregate和grouped.agg完全等价，因为在SelectionMixin类中有这样的定义：agg = aggregate
在这里插入图片描述
但是要聚合多个字短时，就只能用aggregate或者agg了，比如要获取不同category分组下price最大，count最小的记录

还可以结合numpy里的聚合函数

import numpy as np
grouped.agg({'price': np.max, 'count': np.min})

常见的聚合函数如下

聚合函数	功能
max	最大值
mean	平均值
median	中位数
min	最小值
sum	求和
std	标准差
var	方差
count	计数

其中，count在numpy中对应的调用方式为np.size

transform

现在需要新增一列price_mean，展示每个分组的平均价格

transform函数刚好可以实现这个功能，在指定分组上产生一个与原df相同索引的DataFrame，返回与原对象有相同索引且已填充了转换后的值的DataFrame，然后可以把转换结果新增到原来的DataFrame上
示例代码如下

grouped = df.groupby('category', sort=False)
df['price_mean'] = grouped['price'].transform('mean')
print(df)

输出结果如下
在这里插入图片描述

apply

现在需要获取各个分组下价格最高的数据，调用apply可以实现这个功能，apply可以传入任意自定义的函数，实现复杂的数据操作

from pandas import DataFrame
grouped = df.groupby('category', as_index=False, sort=False)

def get_max_one(the_df: DataFrame):
    sort_df = the_df.sort_values(by='price', ascending=True)
    return sort_df.iloc[-1, :]
max_price_df = grouped.apply(get_max_one)
max_price_df

输出结果如下
在这里插入图片描述

filter

filter函数可以对分组后数据做进一步筛选，该函数在每一个分组内，根据筛选函数排除不满足条件的数据并返回一个新的DataFrame

假设现在要把平均价格低于4的分组排除掉，根据transform小节的数据，会把蔬菜分类过滤掉

grouped = df.groupby('category', as_index=False, sort=False)
filtered = grouped.filter(lambda sub_df: sub_df['price'].mean() > 4)
print(filtered)

输出结果如下
在这里插入图片描述

四、总结

groupby的过程就是将原有的DataFrame/Series按照groupby的字段，划分为若干个分组DataFrame/Series，分成多少个组就有多少个分组DataFrame/Series。因此，在groupby之后的一系列操作（如agg、apply等），均是基于子DataFrame/Series的操作。理解了这点，就理解了Pandas中groupby操作的主要原理

五、参考文档

Pandas官网关于pandas.DateFrame.groupby的介绍
 Pandas官网关于pandas.Series.groupby的介绍

GitCode 开源社区

旨在为数千万中国开发者提供一个无缝且高效的云端环境，以支持学习、使用和贡献开源项目。

更多推荐

[转载]在Windows环境下安装GNU Radio

转自：在Windows环境下安装GNURadio_恐弱智_新浪博客GNU Radio是用Python开发的，大部分开源的工程能够在Linux环境下运行良好，而Windows下却运行的很勉强，而且安装配置都很复杂。GNU Radio算是个例外了，不光提供了Windows的二进制安装，还有比较详细的说明。我是Python小白，所以折腾了好久才弄好，特意记录下来，免得以后再装还折腾。GNU Radio的

GitCode 开源社区

centOS 8 使用dnf安装Docker

DNF是什么？CentOS 8使用YUM软件包管理器版本v4.0.4。现在，该版本使用DNF(已删除YUM)。DNF是软件包管理器。它会在Linux发行版上安装，执行更新并删除软件包。使用DNF安装Docker跳过具有损坏依赖性的程序包一个有效的解决方案是使您的CentOS 8系统使用以下--nobest命令安装最符合条件的版本：sudo dnf install docker...

GitCode 开源社区

定时同步数据库表(mysql+linux+crontab)

sync.sh里面的参数需要改变，ip/username/password/database/tablesync.sh#!/bin/sh# Please change the IP and password of the data source db.# Then change the table name.filename=/home/nington/db/$(date +%Y-%m