当前位置:   article > 正文

python的nlargest_python – 使用的更奇怪的结果:pandas中的groupby和nlargest()

pandas groupby nlargest

让我们使用相同的df和所选答案中提出的解决方法.基本上,我正在尝试进行2次groupby操作并选择每组的nlargest N.但是,正如您在下面看到的,我得到其中一个操作的错误.

鉴于原始帖子在代码中发现了一个错误(see here),我想知道是否有另一个错误或同一个错误的另一个表现?

不幸的是,在这些问题得到修复和解决之前,我仍处于工作中.我们能不能在这件事上得到一些关注?直到明天我才能提供赏金.

DF:

{'city1': {0: 'Chicago',

1: 'Chicago',

2: 'Chicago',

3: 'Chicago',

4: 'Miami',

5: 'Houston',

6: 'Austin'},

'city2': {0: 'Toronto',

1: 'Detroit',

2: 'St.Louis',

3: 'Miami',

4: 'Dallas',

5: 'Dallas',

6: 'Dallas'},

'p234_r_c': {0: 5.0, 1: 4.0, 2: 2.0, 3: 0.5, 4: 1.0, 5: 4.0, 6: 3.0},

'plant1_type': {0: 'COMBCYCL',

1: 'COMBCYCL',

2: 'NUKE',

3: 'COAL',

4: 'NUKE',

5: 'COMBCYCL',

6: 'COAL'},

'plant2_type': {0: 'COAL',

1: 'COAL',

2: 'COMBCYCL',

3: 'COMBCYCL',

4: 'COAL',

5: 'NUKE',

6: 'NUKE'}}

您可以使用上面的dict生成df:pd.DataFrame(dct)

First groupby:似乎生成有意义的结果

cols = ['city2','plant1_type','plant2_type']

df.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1).reset_index()

city2 plant1_type plant2_type p234_r_c

0 Toronto COMBCYCL COAL 5.0

1 Detroit COMBCYCL COAL 4.0

2 St.Louis NUKE COMBCYCL 2.0

3 Miami COAL COMBCYCL 0.5

4 Dallas NUKE COAL 1.0

5 Dallas COMBCYCL NUKE 4.0

6 Dallas COAL NUKE 3.0

第二组:产生错误.唯一的区别是使用city1而不是city2.

cols = ['city1','plant1_type','plant2_type']

df.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1).reset_index()

错误结果:

---------------------------------------------------------------------------

ValueError Traceback (most recent call last)

in ()

----> 1 test1.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1).reset_index()

C:\Users\user1\Anaconda3\lib\site-packages\pandas\core\series.py in reset_index(self, level, drop, name, inplace)

967 else:

968 df = self.to_frame(name)

--> 969 return df.reset_index(level=level, drop=drop)

970

971 def __unicode__(self):

C:\Users\user1\Anaconda3\lib\site-packages\pandas\core\frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)

2944 level_values = _maybe_casted_values(lev, lab)

2945 if level is None or i in level:

-> 2946 new_obj.insert(0, col_name, level_values)

2947

2948 elif not drop:

C:\Users\user1\Anaconda3\lib\site-packages\pandas\core\frame.py in insert(self, loc, column, value, allow_duplicates)

2447 value = self._sanitize_column(column, value)

2448 self._data.insert(loc, column, value,

-> 2449 allow_duplicates=allow_duplicates)

2450

2451 def assign(self, **kwargs):

C:\Users\user1\Anaconda3\lib\site-packages\pandas\core\internals.py in insert(self, loc, item, value, allow_duplicates)

3508 if not allow_duplicates and item in self.items:

3509 # Should this be a different kind of error??

-> 3510 raise ValueError('cannot insert %s, already exists' % item)

3511

3512 if not isinstance(loc, int):

ValueError: cannot insert plant2_type, already exists

最后:

如何使用[‘city1′,’plant1_type’,’plant2_type’]在groupby的结果中使用[‘city2′,’plant1_type’,’plant2_type’]和city2列获取groupby结果中的city1列?

我想知道groupby使用[‘city2′,’plant1_type’,’plant2_type’]的相应city1值以及groupby使用[‘city1′,’plant1_type’,’plant2_type’]的相应city2值.

更新:

为什么以下结果具有完全不同的结构?唯一的区别是city2用于#A,而city1用于#B.

一种)

cols = ['city2','plant1_type','plant2_type']

test1.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1)

city2 plant1_type plant2_type

Toronto COMBCYCL COAL 5.0

Detroit COMBCYCL COAL 4.0

St.Louis NUKE COMBCYCL 2.0

Miami COAL COMBCYCL 0.5

Dallas NUKE COAL 1.0

COMBCYCL NUKE 4.0

COAL NUKE 3.0

Name: p234_r_c, dtype: float64

B)

cols2 = ['city1','plant1_type','plant2_type']

test1.set_index(cols2).groupby(level=cols2)['p234_r_c'].nlargest(1)

city1 plant1_type plant2_type city1 plant1_type plant2_type

Austin COAL NUKE Austin COAL NUKE 3.0

Chicago COAL COMBCYCL Chicago COAL COMBCYCL 0.5

COMBCYCL COAL Chicago COMBCYCL COAL 5.0

NUKE COMBCYCL Chicago NUKE COMBCYCL 2.0

Houston COMBCYCL NUKE Houston COMBCYCL NUKE 4.0

Miami NUKE COAL Miami NUKE COAL 1.0

Name: p234_r_c, dtype: float64

解决方法:

尝试这个:

In [76]: df.groupby(cols2)['p234_r_c'].nlargest(1).reset_index(level=3, drop=True).reset_index()

Out[76]:

city1 plant1_type plant2_type p234_r_c

0 Austin COAL NUKE 3.0

1 Chicago COAL COMBCYCL 0.5

2 Chicago COMBCYCL COAL 5.0

3 Chicago NUKE COMBCYCL 2.0

4 Houston COMBCYCL NUKE 4.0

5 Miami NUKE COAL 1.0

坦率地说,我不明白以下行为:

In [77]: df.set_index(cols2).groupby(level=cols2)['p234_r_c'].nlargest(1)

Out[77]:

city1 plant1_type plant2_type city1 plant1_type plant2_type

Austin COAL NUKE Austin COAL NUKE 3.0

Chicago COAL COMBCYCL Chicago COAL COMBCYCL 0.5

COMBCYCL COAL Chicago COMBCYCL COAL 5.0

NUKE COMBCYCL Chicago NUKE COMBCYCL 2.0

Houston COMBCYCL NUKE Houston COMBCYCL NUKE 4.0

Miami NUKE COAL Miami NUKE COAL 1.0

Name: p234_r_c, dtype: float64

哪里:

In [78]: cols2

Out[78]: ['city1', 'plant1_type', 'plant2_type']

来源:https://www.icode9.com/content-1-493701.html

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小小林熬夜学编程/article/detail/161938
推荐阅读
相关标签
  

闽ICP备14008679号