pandas qcut
Binning the data can be a very useful strategy while dealing with numeric data to understand certain trends. Sometimes, we may need an age range, not the exact age, a profit margin not profit, a grade not a score. The Binning of data is very helpful to address those. Pandas library has two useful functions cut and qcut for data binding. But sometimes they can be confusing. In this article, I will try to explain the use of both in detail.
在处理数字数据以了解某些趋势时,对数据进行分箱可能是非常有用的策略。 有时,我们可能需要年龄范围,而不是确切年龄,利润率不是利润,等级不是分数。 数据分类对于解决这些问题非常有帮助。 Pandas库具有两个有用的函数cut和qcut用于数据绑定。 但有时它们可能会造成混淆。 在本文中,我将尝试详细解释两者的用法。
装箱 (Binning)
To understand the concept of binning, we may refer to a histogram. I am going to use a student performance dataset for this tutorial. Please feel free to download the dataset from this link:
为了理解合并的概念,我们可以参考直方图。 我将在本教程中使用学生成绩数据集。 请随时从以下链接下载数据集:
Import the necessary packages and the dataset now.
立即导入必要的程序包和数据集。
import pandas as pdimport numpy as npimport seaborn as snsdf = pd.read_csv('StudentsPerformance.csv')
Using the dataset above, make a histogram of the math score data:
使用上面的数据集,对数学分数数据进行直方图绘制:
df['math score'].plot(kind='hist')
We did not mention any number of bins here but behind the scene, there was a binning operation. Math scores have been divided into 10 bins like 20–30, 30–40. There are many scenarios where we need to define the bins discretely and use them in the data analysis.
我们在这里没有提到任何数量的垃圾箱,但是在后台,有一个垃圾箱操作。 数学成绩已分为10个等级,例如20–30、30–40。 在许多情况下,我们需要离散地定义bin,并在数据分析中使用它们。
qcut (qcut)
This function tries to divide the data into equal-sized bins. The bins are defined using percentiles, based on the distribution and not on the actual numeric edges of the bins. So, you may expect the exact equal-sized bins in simple data