赞
踩
- I. 数据库风格的合并——merge
- i) 最简单的合并
- pd.merge(df1, df2, on='key') key为重叠列名
-
- ii) 连接键列名不同
- pd.merge(left, right, left_on='lkey', right_on='rkey')
-
- iii) 连接方式(默认为inner)
- pd.merge(left, right, on='key', how='outer')
-
- iv) 连接键为多列
- pd.merge(left, right, on=['key1','key2'])
-
- v) 重复列名的处理
- pd.merge(left, right, on='key', suffixes=['_left','_right'])
-
- vi) 索引上的合并(索引作为连接键)
- pd.merge(left, right, left_on='key', right_index=True)
-
-
- II. 按索引合并——join
- i)join实例方法实现按索引合并
- left.join(right, how='outer')
-
- ii)【参数DataFrame的索引】跟【调用者DataFrame的某个列】之间的连接
- left.join(right, on='key')
-
- iii)join方法合并多个DataFrame
- df1.join([df2,df3], how='outer', sort=True)
-
-
- III. 轴向连接——concat方法
- i) Series连接(axis=0)
- pd.concat([s1,s2,s3])
-
- ii) Series连接(axis=1)
- pd.concat([s1,s2,s3], axis=1, sort=True)
-
- iii) 连接方式(默认join='outer')
- pd.concat([s1,s2], axis=1, how='inner', sort=True)
-
- iv) 指定在非连接轴上使用的索引
- pd.concat([s1,s2], axis=1, sort=True, join_axes=['a','c','b','e'])
-
- v) 区分连接片段
- pd.concat([df1,df2], axis=1, sort=True, keys=['left','right'],
- names=['level0', 'level1'])
-
- vi) 抛弃无关行索引
- pd.concat([df1,df2], sort=True, ignore_index=True)
-
-
- IV. 合并重叠数据——combine_first()
- df1.combine_first(df2)
-
-
- V. df末尾追加数据——append
- Merge DataFrame objects by performing a database-style join operation by
- columns or indexes.
-
-
- merge(left, right, how='inner', on=None, left_on=None, right_on=None,
- left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'),
- copy=True, indicator=False, validate=None)
-
-
- left : DataFrame
-
- right : DataFrame
-
- how : {'left', 'right', 'outer', 'inner'}, default 'inner'
- * left: use only keys from left frame, similar to a SQL left outer join;
- preserve key order
- * right: use only keys from right frame, similar to a SQL right outer join;
- preserve key order
- * outer: use union of keys from both frames, similar to a SQL full outer
- join; sort keys lexicographically
- * inner: use intersection of keys from both frames, similar to a SQL inner
- join; preserve the order of the left keys
-
- on : label or list
- Column or index level names to join on. These must be found in both
- DataFrames. If `on` is None and not merging on indexes then this defaults
- to the intersection of the columns in both DataFrames.
-
- left_on : label or list, or array-like
- Column or index level names to join on in the left DataFrame. Can also
- be an array or list of arrays of the length of the left DataFrame.
- These arrays are treated as if they are columns.
-
- right_on : label or list, or array-like
- Column or index level names to join on in the right DataFrame. Can also
- be an array or list of arrays of the length of the right DataFrame.
- These arrays are treated as if they are columns.
-
- left_index : boolean, default False
- Use the index from the left DataFrame as the join key(s). If it is a
- MultiIndex, the number of keys in the other DataFrame (either the index
- or a number of columns) must match the number of levels
-
- right_index : boolean, default False
- Use the index from the right DataFrame as the join key. Same caveats as
- left_index
-
- sort : boolean, default False
- Sort the join keys lexicographically in the result DataFrame. If False,
- the order of the join keys depends on the join type (how keyword)
-
- suffixes : 2-length sequence (tuple, list, ...)
- Suffix to apply to overlapping column names in the left and right
- side, respectively
-
- copy : boolean, default True
- If False, do not copy data unnecessarily
-
-
- 重点参数
- left
- right
- how
- on
- left_on
- right_on
- left_index
- right_index
- 即 数据、连接方式、连接键
数据集的合并(merge)或连接(join)运算是通过一个或多个键将行连接起来的。这些运算是关系型数据库的核心
i) 最简单的合并
最简单的连接(如果没有显式指定连接键,merge默认将重叠的列名当作键)
最好还是指定连接键:
ii) 连接键列名不同
如果在左右DataFrame中作为连接键的列有不同的列名,或者说左侧DataFrame中用作连接键的列与右侧DataFrame中用作连接键的列不同,可以用left_on和(或)right_on关键字(分别)显示指定
iii) 连接方式(默认为inner)
上面结果中没有c、d及与之相关的数据,因为merge默认做的是"inner"连接,结果中的键是交集。如果需要其他连接方式,用how关键字显式指定。
how='outer'
how='left'
how默认为'inner'
iv) 连接键为多列
如果要根据多个键进行合并,传入一个由列名组成的列表即可:
v) 重复列名的处理
向suffixes关键字传入一个【附加到重叠列名上的字符串】的元组或列表,可以指定重叠列名的后缀
如果不指定,后缀名默认为"_x"和"_y"
如果向suffixes传入【重叠列名后缀名字符串的列表】显式指定:
vi) 索引上的合并(索引作为连接键)
有时候,DataFrame中的连接键位于其索引上。在这种情况下,可以传入"left_index=True"或right_index=True"(或两个都传)以说明索引应该被用作连接键
如果把层次化索引作为连接键,对应需要用列表的形式指明另一个DataFrame中用作合并键的多个列
- Join columns with other DataFrame either on index or on a key
- column. Efficiently Join multiple DataFrame objects by index at once by
- passing a list.
-
-
- join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
-
-
- other : DataFrame, Series with name field set, or list of DataFrame
- Index should be similar to one of the columns in this one. If a
- Series is passed, its name attribute must be set, and that will be
- used as the column name in the resulting joined DataFrame
-
- on : name, tuple/list of names, or array-like
- Column or index level name(s) in the caller to join on the index
- in `other`, otherwise joins index-on-index. If multiple
- values given, the `other` DataFrame must have a MultiIndex. Can
- pass an array as the join key if it is not already contained in
- the calling DataFrame. Like an Excel VLOOKUP operation
-
- how : {'left', 'right', 'outer', 'inner'}, default: 'left'
- How to handle the operation of the two objects.
-
- * left: use calling frame's index (or column if on is specified)
- * right: use other frame's index
- * outer: form union of calling frame's index (or column if on is
- specified) with other frame's index, and sort it
- lexicographically
- * inner: form intersection of calling frame's index (or column if
- on is specified) with other frame's index, preserving the order
- of the calling's one
- lsuffix : string
- Suffix to use from left frame's overlapping columns
-
- rsuffix : string
- Suffix to use from right frame's overlapping columns
- sort : boolean, default False
- Order result DataFrame lexicographically by the join key. If False,
- the order of the join key depends on the join type (how keyword)
- 重点参数:
- other
- on
- how
- 即数据、连接方式、连接键(默认将索引作为连接键)
i)join实例方法实现按索引合并
DataFrame还有一个join实例方法,它能更为方便地实现按索引合并,可以用于合并多个带有相同或相似索引的DataFrame对象(join默认在连接键上做左连接,使用时最好用how关键字显式指定连接方式)
ii)【参数DataFrame的索引】跟【调用者DataFrame的某个列】之间的连接
join实例方法还支持【参数DataFrame的索引】跟【调用者DataFrame的某个列】之间的连接:
iii)join方法合并多个DataFrame
对于简单的索引合并,还可以向join传入一个DataFrame对象的列表,合并多个df
- Concatenate pandas objects along a particular axis with optional set logic
- along the other axes.
-
- Can also add a layer of hierarchical indexing on the concatenation axis,
- which may be useful if the labels are the same (or overlapping) on
- the passed axis number.
-
-
-
- concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
- keys=None, levels=None, names=None, verify_integrity=False,
- sort=None, copy=True)
-
-
- objs : a sequence or mapping of Series, DataFrame, or Panel objects
- If a dict is passed, the sorted keys will be used as the `keys`
- argument, unless it is passed, in which case the values will be
- selected (see below). Any None objects will be dropped silently unless
- they are all None in which case a ValueError will be raised
-
- axis : {0/'index', 1/'columns'}, default 0
- The axis to concatenate along
-
- join : {'inner', 'outer'}, default 'outer'
- How to handle indexes on other axis(es)
-
- join_axes : list of Index objects
- Specific indexes to use for the other n - 1 axes instead of performing
- inner/outer set logic
-
- ignore_index : boolean, default False
- If True, do not use the index values along the concatenation axis. The
- resulting axis will be labeled 0, ..., n - 1. This is useful if you are
- concatenating objects where the concatenation axis does not have
- meaningful indexing information. Note the index values on the other
- axes are still respected in the join.
-
- keys : sequence, default None
- If multiple levels passed, should contain tuples. Construct
- hierarchical index using the passed keys as the outermost level
-
- levels : list of sequences, default None
- Specific levels (unique values) to use for constructing a
- MultiIndex. Otherwise they will be inferred from the keys
-
- names : list, default None
- Names for the levels in the resulting hierarchical index
-
- verify_integrity : boolean, default False
- Check whether the new concatenated axis contains duplicates. This can
- be very expensive relative to the actual data concatenation
-
- sort : boolean, default None
- Sort non-concatenation axis if it is not already aligned when `join`
- is 'outer'. The current default of sorting is deprecated and will
- change to not-sorting in a future version of pandas.
-
- Explicitly pass ``sort=True`` to silence the warning and sort.
- Explicitly pass ``sort=False`` to silence the warning and not sort.
-
- This has no effect when ``join='inner'``, which already preserves
- the order of the non-concatenation axis.
-
- .. versionadded:: 0.23.0
-
- copy : boolean, default True
- If False, do not copy data unnecessarily
-
-
- 重点参数:
- objs
- axis 连接轴
- join 连接方式
- sort 非连接轴上标签的是否列序
concat的参数join默认为"outer",默认做的是外连接
i) Series连接(axis=0)
concat默认在axis=0上工作,连接多个Series,最终产生一个新的Series
ii) Series连接(axis=1)
如果传入axis=1,则结果变成DataFrame
iii) 连接方式(默认join='outer')
iv) 指定在非连接轴上使用的索引
可以通过join_axes指定要在其他轴(非连接轴的另一条轴)上使用的索引:
v) 区分连接片段
如果想要区分参与连接的片段,或者想要在连接轴上创建一个层次化索引,可以使用keys参数:
如果沿着axis=1对Series进行合并,则keys就会成为DataFrame的列头:
对DataFrame也是一样:
如果传入的不是列表,而是一个字典,则字典的键就会被当作keys选项的值:
还可以继续向names参数传入一个字符串的列表,创建分层级别的名称:
vi) 抛弃无关行索引
传入ignore_index=True抛弃无关索引
如果不传,索引是子df索引的拼接:
传的话,不保留连接轴上的索引,产生一组新索引range(total_length):
一个示例,合并多个csv
- import os
- import pandas as pd
-
- path = r'C:UsersbyqpzDesktoptrade2017解压缩201710'
-
- dfList = []
- os.chdir(path)
- for filename in os.listdir(path):
- if os.path.splitext(filename)[1] == '.csv':
- df = pd.read_csv(filename,engine='python')
- dfList.append(df)
-
- df = pd.concat(dfList,sort=False)
- df.to_csv(path+'result.csv', encoding='utf_8_sig', index=False)
- Combine two DataFrame objects and default to non-null values in frame
- calling the method. Result index columns will be the union of the
- respective indexes and columns
-
-
- combine_first(other)
-
-
- other : DataFrame
combine_first( )方法用参数对象中的数据为调用者对象中的数据“打补丁”
用于把'other'数据的行附加到this frame的末尾,返回一个附加数据后的新对象。不在this frame中的column将作为新列添加进来。
参数如下:
- Append rows of `other` to the end of this frame, returning a new
- object. Columns not in this frame are added as new columns.
-
- append(other, ignore_index=False, verify_integrity=False, sort=None)
-
- Parameters
- ----------
- other : DataFrame or Series/dict-like object, or list of these
- The data to append.
- ignore_index : boolean, default False
- If True, do not use the index labels.
- verify_integrity : boolean, default False
- If True, raise ValueError on creating index with duplicates.
- sort : boolean, default None
- Sort columns if the columns of `self` and `other` are not aligned.
- The default sorting is deprecated and will change to not-sorting
- in a future version of pandas. Explicitly pass ``sort=True`` to
- silence the warning and sort. Explicitly pass ``sort=False`` to
- silence the warning and not sort.
-
- .. versionadded:: 0.23.0
-
- Returns
- -------
- appended : DataFrame
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。