- # 44、pandas.crosstab函数
- pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)
- Compute a simple cross tabulation of two (or more) factors.
- By default, computes a frequency table of the factors unless an array of values and an aggregation function are passed.
- Parameters:
- index
- array-like, Series, or list of arrays/Series
- Values to group by in the rows.
- columns
- array-like, Series, or list of arrays/Series
- Values to group by in the columns.
- values
- array-like, optional
- Array of values to aggregate according to the factors. Requires aggfunc be specified.
- rownames
- sequence, default None
- If passed, must match number of row arrays passed.
- colnames
- sequence, default None
- If passed, must match number of column arrays passed.
- aggfunc
- function, optional
- If specified, requires values be specified as well.
- margins
- bool, default False
- Add row/column margins (subtotals).
- margins_name
- str, default ‘All’
- Name of the row/column that will contain the totals when margins is True.
- dropna
- bool, default True
- Do not include columns whose entries are all NaN.
- normalize
- bool, {‘all’, ‘index’, ‘columns’}, or {0,1}, default False
- Normalize by dividing all values by the sum of values.
- If passed ‘all’ or True, will normalize over all values.
- If passed ‘index’ will normalize over each row.
- If passed ‘columns’ will normalize over each column.
- If margins is True, will also normalize margin values.
- Returns:
- DataFrame
- Cross tabulation of the data.
44-5-4、如果normalize参数为True或'all',则交叉表中的值会被归一化,使得每行或每列(或整个交叉表)的总和等于 1;如果normalize为'index'或'columns',则分别对每行或每列进行归一化。
- # 44、pandas.crosstab函数
- import pandas as pd
- import numpy as np
- # 创建一个示例数据集
- data = {
- 'Date': pd.date_range('2023-01-01', periods=6, freq='D'),
- 'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'New York', 'Los Angeles'],
- 'Category': ['A', 'A', 'B', 'B', 'A', 'B'],
- 'Values': [100, 200, 150, 250, np.nan, 300]
- }
- df = pd.DataFrame(data)
- print("原始数据集:")
- print(df)
- # 使用crosstab函数创建交叉表
- crosstab_result = pd.crosstab(
- index=[df['Date'], df['City']],
- columns=df['Category'],
- values=df['Values'],
- rownames=['Date', 'City'],
- colnames=['Category'],
- aggfunc='sum',
- margins=True,
- margins_name='All',
- dropna=True,
- normalize=False
- )
- print("\ncrosstab结果:")
- print(crosstab_result)
- # 44、pandas.crosstab函数
- # 原始数据集:
- # Date City Category Values
- # 0 2023-01-01 New York A 100.0
- # 1 2023-01-02 Los Angeles A 200.0
- # 2 2023-01-03 New York B 150.0
- # 3 2023-01-04 Los Angeles B 250.0
- # 4 2023-01-05 New York A NaN
- # 5 2023-01-06 Los Angeles B 300.0
- # crosstab结果:
- # Category A B All
- # Date City
- # 2023-01-01 00:00:00 New York 100.0 NaN 100.0
- # 2023-01-02 00:00:00 Los Angeles 200.0 NaN 200.0
- # 2023-01-03 00:00:00 New York NaN 150.0 150.0
- # 2023-01-04 00:00:00 Los Angeles NaN 250.0 250.0
- # 2023-01-05 00:00:00 New York 0.0 NaN NaN
- # 2023-01-06 00:00:00 Los Angeles NaN 300.0 300.0
- # All 300.0 700.0 1000.0
- # 45、pandas.cut函数
- pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)
- Bin values into discrete intervals.
- Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.
- Parameters:
- x
- array-like
- The input array to be binned. Must be 1-dimensional.
- bins
- int, sequence of scalars, or IntervalIndex
- The criteria to bin by.
- int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.
- sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range of x is done.
- IntervalIndex : Defines the exact bins to be used. Note that IntervalIndex for bins must be non-overlapping.
- right
- bool, default True
- Indicates whether bins includes the rightmost edge or not. If right == True (the default), then the bins [1, 2, 3, 4] indicate (1,2], (2,3], (3,4]. This argument is ignored when bins is an IntervalIndex.
- labels
- array or False, default None
- Specifies the labels for the returned bins. Must be the same length as the resulting bins. If False, returns only integer indicators of the bins. This affects the type of the output container (see below). This argument is ignored when bins is an IntervalIndex. If True, raises an error. When ordered=False, labels must be provided.
- retbins
- bool, default False
- Whether to return the bins or not. Useful when bins is provided as a scalar.
- precision
- int, default 3
- The precision at which to store and display the bins labels.
- include_lowest
- bool, default False
- Whether the first interval should be left-inclusive or not.
- duplicates
- {default ‘raise’, ‘drop’}, optional
- If bin edges are not unique, raise ValueError or drop non-uniques.
- ordered
- bool, default True
- Whether the labels are ordered or not. Applies to returned types Categorical and Series (with Categorical dtype). If True, the resulting categorical will be ordered. If False, the resulting categorical will be unordered (labels must be provided).
- Returns:
- out
- Categorical, Series, or ndarray
- An array-like object representing the respective bin for each value of x. The type depends on the value of labels.
- None (default) : returns a Series for Series x or a Categorical for all other inputs. The values stored within are Interval dtype.
- sequence of scalars : returns a Series for Series x or a Categorical for all other inputs. The values stored within are whatever the type in the sequence is.
- False : returns an ndarray of integers.
- bins
- numpy.ndarray or IntervalIndex.
- The computed or specified bins. Only returned when retbins=True. For scalar or sequence bins, this is an ndarray with the computed bins. If set duplicates=drop, bins will drop non-unique bin. For an IntervalIndex bins, this is equal to bins.
45-2-4、labels(可选,默认值为None):用于标记输出类别的数组或序列,如果给定,它必须与生成的区间数量相同;如果未提供,则使用默认标签,如[(0, 1], (1, 2], ...。
45-2-8、duplicates(可选,默认值为'raise'):{'raise', 'drop'},如果bins包含重复值,则:
45-4-1、当不设置retbins=True时,pandas.cut函数返回一个Categorical对象,该对象包含了输入数据 x
- # 45、pandas.cut函数
- import pandas as pd
- # 创建一个示例数据集
- data = {
- 'Age': [22, 25, 45, 33, 50, 41, 23, 37, 29, 31, 35, 48, 52, 44, 27]
- }
- df = pd.DataFrame(data)
- print("原始数据集:")
- print(df)
- # 定义区间
- bins = [20, 30, 40, 50, 60]
- # 使用cut函数将年龄分割成不同的区间
- df['Age Group'] = pd.cut(
- x=df['Age'],
- bins=bins,
- right=True,
- labels=['20-30', '30-40', '40-50', '50-60'],
- retbins=False,
- precision=0,
- include_lowest=True,
- duplicates='raise',
- ordered=True
- )
- print("\n分割后的数据集:")
- print(df)
- # 45、pandas.cut函数
- # 原始数据集:
- # Age
- # 0 22
- # 1 25
- # 2 45
- # 3 33
- # 4 50
- # 5 41
- # 6 23
- # 7 37
- # 8 29
- # 9 31
- # 10 35
- # 11 48
- # 12 52
- # 13 44
- # 14 27
- # 分割后的数据集:
- # Age Age Group
- # 0 22 20-30
- # 1 25 20-30
- # 2 45 40-50
- # 3 33 30-40
- # 4 50 40-50
- # 5 41 40-50
- # 6 23 20-30
- # 7 37 30-40
- # 8 29 20-30
- # 9 31 30-40
- # 10 35 30-40
- # 11 48 40-50
- # 12 52 50-60
- # 13 44 40-50
- # 14 27 20-30
- # 46、pandas.qcut函数
- pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')
- Quantile-based discretization function.
- Discretize variable into equal-sized buckets based on rank or based on sample quantiles. For example 1000 values for 10 quantiles would produce a Categorical object indicating quantile membership for each data point.
- Parameters:
- x
- 1d ndarray or Series
- q
- int or list-like of float
- Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles.
- labels
- array or False, default None
- Used as labels for the resulting bins. Must be of the same length as the resulting bins. If False, return only integer indicators of the bins. If True, raises an error.
- retbins
- bool, optional
- Whether to return the (bins, labels) or not. Can be useful if bins is given as a scalar.
- precision
- int, optional
- The precision at which to store and display the bins labels.
- duplicates
- {default ‘raise’, ‘drop’}, optional
- If bin edges are not unique, raise ValueError or drop non-uniques.
- Returns:
- out
- Categorical or Series or array of integers if labels is False
- The return type (Categorical or Series) depends on the input: a Series of type category if input is a Series else Categorical. Bins are represented as categories when categorical data is returned.
- bins
- ndarray of floats
- Returned only if retbins is True.
- Notes
- Out of bounds values will be NA in the resulting Categorical object
46-2-2、q(必须):int或array-like of quantiles,如果是一个整数,它表示要分成的箱(或桶)的数量;如果是一个数组,则必须包含从0到1的浮点数,表示分位数。例如[0, 0.25, 0.5, 0.75, 1.]
46-4-1-1、bins:一个与x形状相同的分类数组(Categorical dtype),表示每个元素所属的箱(或桶)。
46-4-2-1、bins :与x
- # 46、pandas.qcut函数
- import pandas as pd
- # 创建一个示例数据集
- data = {
- 'Age': [22, 25, 45, 33, 50, 41, 23, 37, 29, 31, 35, 48, 52, 44, 27]
- }
- df = pd.DataFrame(data)
- print("原始数据集:")
- print(df)
- # 使用qcut函数将年龄按分位数分割成四个区间
- df['Age Group'] = pd.qcut(
- x=df['Age'],
- q=4,
- labels=['Q1', 'Q2', 'Q3', 'Q4'],
- retbins=False,
- precision=3,
- duplicates='raise'
- )
- print("\n按分位数分割后的数据集:")
- print(df)
- # 46、pandas.qcut函数
- # 原始数据集:
- # Age
- # 0 22
- # 1 25
- # 2 45
- # 3 33
- # 4 50
- # 5 41
- # 6 23
- # 7 37
- # 8 29
- # 9 31
- # 10 35
- # 11 48
- # 12 52
- # 13 44
- # 14 27
- # 按分位数分割后的数据集:
- # Age Age Group
- # 0 22 Q1
- # 1 25 Q1
- # 2 45 Q4
- # 3 33 Q2
- # 4 50 Q4
- # 5 41 Q3
- # 6 23 Q1
- # 7 37 Q3
- # 8 29 Q2
- # 9 31 Q2
- # 10 35 Q2
- # 11 48 Q4
- # 12 52 Q4
- # 13 44 Q3
- # 14 27 Q1
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。