当前位置:   article > 正文

series合并成dataframe_【S01E04】pandas之合并数据集

pandas中怎么将series合并到dataframe中

991a095f7745f3143e702f733185f02d.png
  1. I. 数据库风格的合并——merge
  2. i) 最简单的合并
  3. pd.merge(df1, df2, on='key') key为重叠列名
  4. ii) 连接键列名不同
  5. pd.merge(left, right, left_on='lkey', right_on='rkey')
  6. iii) 连接方式(默认为inner)
  7. pd.merge(left, right, on='key', how='outer')
  8. iv) 连接键为多列
  9. pd.merge(left, right, on=['key1','key2'])
  10. v) 重复列名的处理
  11. pd.merge(left, right, on='key', suffixes=['_left','_right'])
  12. vi) 索引上的合并(索引作为连接键)
  13. pd.merge(left, right, left_on='key', right_index=True)
  14. II. 按索引合并——join
  15. i)join实例方法实现按索引合并
  16. left.join(right, how='outer')
  17. ii)【参数DataFrame的索引】跟【调用者DataFrame的某个列】之间的连接
  18. left.join(right, on='key')
  19. iii)join方法合并多个DataFrame
  20. df1.join([df2,df3], how='outer', sort=True)
  21. III. 轴向连接——concat方法
  22. i) Series连接(axis=0
  23. pd.concat([s1,s2,s3])
  24. ii) Series连接(axis=1
  25. pd.concat([s1,s2,s3], axis=1, sort=True)
  26. iii) 连接方式(默认join='outer'
  27. pd.concat([s1,s2], axis=1, how='inner', sort=True)
  28. iv) 指定在非连接轴上使用的索引
  29. pd.concat([s1,s2], axis=1, sort=True, join_axes=['a','c','b','e'])
  30. v) 区分连接片段
  31. pd.concat([df1,df2], axis=1, sort=True, keys=['left','right'],
  32. names=['level0', 'level1'])
  33. vi) 抛弃无关行索引
  34. pd.concat([df1,df2], sort=True, ignore_index=True)
  35. IV. 合并重叠数据——combine_first()
  36. df1.combine_first(df2)
  37. V. df末尾追加数据——append
  • pandas.merge( )可根据一个或多个键将不同DataFrame中的行连接起来。(类似数据库的连接操作,merge默认做的是"inner"连接,join默认做的是"left"连接)
  • pandas.concat( )可以沿着一条轴将多个对象堆叠到一起。(concat默认做的是"outer"连接)
  • 实例方法combine_first( )可以将重复数据编接在一起,用一个对象中的值填充另一个对象中的值。

I. 数据库风格的合并——merge

  1. Merge DataFrame objects by performing a database-style join operation by
  2. columns or indexes.
  3. merge(left, right, how='inner', on=None, left_on=None, right_on=None,
  4. left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'),
  5. copy=True, indicator=False, validate=None)
  6. left : DataFrame
  7. right : DataFrame
  8. how : {'left', 'right', 'outer', 'inner'}, default 'inner'
  9. * left: use only keys from left frame, similar to a SQL left outer join;
  10. preserve key order
  11. * right: use only keys from right frame, similar to a SQL right outer join;
  12. preserve key order
  13. * outer: use union of keys from both frames, similar to a SQL full outer
  14. join; sort keys lexicographically
  15. * inner: use intersection of keys from both frames, similar to a SQL inner
  16. join; preserve the order of the left keys
  17. on : label or list
  18. Column or index level names to join on. These must be found in both
  19. DataFrames. If `on` is None and not merging on indexes then this defaults
  20. to the intersection of the columns in both DataFrames.
  21. left_on : label or list, or array-like
  22. Column or index level names to join on in the left DataFrame. Can also
  23. be an array or list of arrays of the length of the left DataFrame.
  24. These arrays are treated as if they are columns.
  25. right_on : label or list, or array-like
  26. Column or index level names to join on in the right DataFrame. Can also
  27. be an array or list of arrays of the length of the right DataFrame.
  28. These arrays are treated as if they are columns.
  29. left_index : boolean, default False
  30. Use the index from the left DataFrame as the join key(s). If it is a
  31. MultiIndex, the number of keys in the other DataFrame (either the index
  32. or a number of columns) must match the number of levels
  33. right_index : boolean, default False
  34. Use the index from the right DataFrame as the join key. Same caveats as
  35. left_index
  36. sort : boolean, default False
  37. Sort the join keys lexicographically in the result DataFrame. If False,
  38. the order of the join keys depends on the join type (how keyword)
  39. suffixes : 2-length sequence (tuple, list, ...)
  40. Suffix to apply to overlapping column names in the left and right
  41. side, respectively
  42. copy : boolean, default True
  43. If False, do not copy data unnecessarily
  44. 重点参数
  45. left
  46. right
  47. how
  48. on
  49. left_on
  50. right_on
  51. left_index
  52. right_index
  53. 即 数据、连接方式、连接键

数据集的合并(merge)或连接(join)运算是通过一个或多个键将行连接起来的。这些运算是关系型数据库的核心

i) 最简单的合并

最简单的连接(如果没有显式指定连接键,merge默认将重叠的列名当作键)

c137f8388aa611ed131757ea9ed47487.png

858b386258ca174ae70f29d2feab3aab.png

35cd4c01c15f9d1b7a89ee0a99b8631d.png

最好还是指定连接键:

55d2292823152c12641cd3f5928cc61f.png

ii) 连接键列名不同

如果在左右DataFrame中作为连接键的列有不同的列名,或者说左侧DataFrame中用作连接键的列与右侧DataFrame中用作连接键的列不同,可以用left_on和(或)right_on关键字(分别)显示指定

ecac83ea8d4033ff6e6f77c9a84883b5.png

8dd2dc9683799b197485d1e8431be9ab.png

6729fc98666e9c3eb81f1cf6e4295347.png

iii) 连接方式(默认为inner)

上面结果中没有c、d及与之相关的数据,因为merge默认做的是"inner"连接,结果中的键是交集。如果需要其他连接方式,用how关键字显式指定。

how='outer'

16046486ef4fa932e3ed833891ed0800.png

how='left'

b4093f0591c38146a8cc405541cfd7f0.png

2f50377a733bb403838cc8b45ca409f2.png

099b9b045c83aa9078e19ec98ca9e871.png
多对多的连接,结果是行的笛卡尔积。左侧的DataFrame有3个b行,右边的有2个,所以最终结果中就有6个b行

how默认为'inner'

9861ad9a274b0f1838968c7f7fe4c660.png

iv) 连接键为多列

如果要根据多个键进行合并,传入一个由列名组成的列表即可:

5b651a718730721d997288ef032d4d8d.png

faa115a8446668b50b227af2fe5fc078.png

ea03a9e6cf7c15b76d10ad5ff26953d8.png

v) 重复列名的处理

向suffixes关键字传入一个【附加到重叠列名上的字符串】的元组或列表,可以指定重叠列名的后缀

如果不指定,后缀名默认为"_x"和"_y"

92eb95db7d7d628c7556438aceff4840.png

如果向suffixes传入【重叠列名后缀名字符串的列表】显式指定:

b05a45e9d9db28188b6aa938415370e9.png

vi) 索引上的合并(索引作为连接键)

有时候,DataFrame中的连接键位于其索引上。在这种情况下,可以传入"left_index=True"right_index=True"(或两个都传)以说明索引应该被用作连接键

1bfe5b46266b3b2f7f7dcfc5631fbb4a.png

858df6ec917f26c07198a42be365239c.png

851d268beada04ef07102fea22da54b7.png

如果把层次化索引作为连接键,对应需要用列表的形式指明另一个DataFrame中用作合并键的多个列

fc9383b187a81bdc1275f6a0ebfcbabf.png

fe779b58308d672b111a91b7a625d800.png

d5a124a1d69512e28943df84ffd3044f.png

II. 按索引合并——join

  1. Join columns with other DataFrame either on index or on a key
  2. column. Efficiently Join multiple DataFrame objects by index at once by
  3. passing a list.
  4. join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
  5. other : DataFrame, Series with name field set, or list of DataFrame
  6. Index should be similar to one of the columns in this one. If a
  7. Series is passed, its name attribute must be set, and that will be
  8. used as the column name in the resulting joined DataFrame
  9. on : name, tuple/list of names, or array-like
  10. Column or index level name(s) in the caller to join on the index
  11. in `other`, otherwise joins index-on-index. If multiple
  12. values given, the `other` DataFrame must have a MultiIndex. Can
  13. pass an array as the join key if it is not already contained in
  14. the calling DataFrame. Like an Excel VLOOKUP operation
  15. how : {'left', 'right', 'outer', 'inner'}, default: 'left'
  16. How to handle the operation of the two objects.
  17. * left: use calling frame's index (or column if on is specified)
  18. * right: use other frame's index
  19. * outer: form union of calling frame's index (or column if on is
  20. specified) with other frame's index, and sort it
  21. lexicographically
  22. * inner: form intersection of calling frame's index (or column if
  23. on is specified) with other frame's index, preserving the order
  24. of the calling's one
  25. lsuffix : string
  26. Suffix to use from left frame's overlapping columns
  27. rsuffix : string
  28. Suffix to use from right frame's overlapping columns
  29. sort : boolean, default False
  30. Order result DataFrame lexicographically by the join key. If False,
  31. the order of the join key depends on the join type (how keyword)
  32. 重点参数:
  33. other
  34. on
  35. how
  36. 即数据、连接方式、连接键(默认将索引作为连接键)

i)join实例方法实现按索引合并

DataFrame还有一个join实例方法,它能更为方便地实现按索引合并,可以用于合并多个带有相同或相似索引的DataFrame对象(join默认在连接键上做左连接,使用时最好用how关键字显式指定连接方式)

acc32d64ce3ab7c190e1e4e145dd0465.png

9164b4719aa4664e1ad3f422a51bdc1c.png

ii)【参数DataFrame的索引】跟【调用者DataFrame的某个列】之间的连接

join实例方法还支持【参数DataFrame的索引】跟【调用者DataFrame的某个列】之间的连接:

858df6ec917f26c07198a42be365239c.png

affdf9c1281c2a66d61399b7e3dc8c12.png

iii)join方法合并多个DataFrame

对于简单的索引合并,还可以向join传入一个DataFrame对象的列表,合并多个df

3dcccd552728260ac45b872e887a01a8.png

3b1f8c92393f6540894a6beadcbb1eb8.png

365cd8ae2f7bb3ba3892bb5ed4a36044.png

III. 轴向连接——concat方法

  1. Concatenate pandas objects along a particular axis with optional set logic
  2. along the other axes.
  3. Can also add a layer of hierarchical indexing on the concatenation axis,
  4. which may be useful if the labels are the same (or overlapping) on
  5. the passed axis number.
  6. concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
  7. keys=None, levels=None, names=None, verify_integrity=False,
  8. sort=None, copy=True)
  9. objs : a sequence or mapping of Series, DataFrame, or Panel objects
  10. If a dict is passed, the sorted keys will be used as the `keys`
  11. argument, unless it is passed, in which case the values will be
  12. selected (see below). Any None objects will be dropped silently unless
  13. they are all None in which case a ValueError will be raised
  14. axis : {0/'index', 1/'columns'}, default 0
  15. The axis to concatenate along
  16. join : {'inner', 'outer'}, default 'outer'
  17. How to handle indexes on other axis(es)
  18. join_axes : list of Index objects
  19. Specific indexes to use for the other n - 1 axes instead of performing
  20. inner/outer set logic
  21. ignore_index : boolean, default False
  22. If True, do not use the index values along the concatenation axis. The
  23. resulting axis will be labeled 0, ..., n - 1. This is useful if you are
  24. concatenating objects where the concatenation axis does not have
  25. meaningful indexing information. Note the index values on the other
  26. axes are still respected in the join.
  27. keys : sequence, default None
  28. If multiple levels passed, should contain tuples. Construct
  29. hierarchical index using the passed keys as the outermost level
  30. levels : list of sequences, default None
  31. Specific levels (unique values) to use for constructing a
  32. MultiIndex. Otherwise they will be inferred from the keys
  33. names : list, default None
  34. Names for the levels in the resulting hierarchical index
  35. verify_integrity : boolean, default False
  36. Check whether the new concatenated axis contains duplicates. This can
  37. be very expensive relative to the actual data concatenation
  38. sort : boolean, default None
  39. Sort non-concatenation axis if it is not already aligned when `join`
  40. is 'outer'. The current default of sorting is deprecated and will
  41. change to not-sorting in a future version of pandas.
  42. Explicitly pass ``sort=True`` to silence the warning and sort.
  43. Explicitly pass ``sort=False`` to silence the warning and not sort.
  44. This has no effect when ``join='inner'``, which already preserves
  45. the order of the non-concatenation axis.
  46. .. versionadded:: 0.23.0
  47. copy : boolean, default True
  48. If False, do not copy data unnecessarily
  49. 重点参数:
  50. objs
  51. axis 连接轴
  52. join 连接方式
  53. sort 非连接轴上标签的是否列序

concat的参数join默认为"outer",默认做的是外连接

i) Series连接(axis=0)

1845a727851ccc6d397d94e6672b4946.png

8ec2a8014459c170799646d2dddd1624.png

concat默认在axis=0上工作,连接多个Series,最终产生一个新的Series

efbc91ddb82fd38de2bbc84cde208d9c.png

ii) Series连接(axis=1)

如果传入axis=1,则结果变成DataFrame

a0998800cc7e9c1dcf6f0f366bcac404.png

iii) 连接方式(默认join='outer'

f1ead14957fd6134ee1a71f0704718d5.png

7fce1c0f3600c02b6c63d109406dd9c4.png

52be8a9f69c703ce52b54ac5d27ebcd7.png

4b06b3603f4abbaec20c7e44cc0a1277.png

iv) 指定在非连接轴上使用的索引

可以通过join_axes指定要在其他轴(非连接轴的另一条轴)上使用的索引:

6dcba5962658623aa5f075172796f0dd.png

v) 区分连接片段

如果想要区分参与连接的片段,或者想要在连接轴上创建一个层次化索引,可以使用keys参数:

ed8bbe107febada8f483eba6eb0ad570.png

4185736ac4496e39215a0e493df8d275.png
使用keys参数在连接轴上创建一个层次化索引

如果沿着axis=1对Series进行合并,则keys就会成为DataFrame的列头:

62b2604ea7d19392ed724ea2507aa3d5.png

bb33b922578248e3c3d12ed6b716c2bb.png

对DataFrame也是一样:

507044d3311308fa25a8c80bee137a4a.png

28834d5a400dd8bbdbac3991ecde9cf4.png

62f323773eba7a9c72c0aa4f0c05a891.png

如果传入的不是列表,而是一个字典,则字典的键就会被当作keys选项的值:

4f58f5f86c3da81370b77954c4bfcacb.png

还可以继续向names参数传入一个字符串的列表,创建分层级别的名称:

6f2abed2170798cf349243bb16ba30ad.png

f191b9184512dbd21bd12d59732dd4c8.png

vi) 抛弃无关行索引

传入ignore_index=True抛弃无关索引

33669ec1a1f9d34286acb482464979ac.png

如果不传,索引是子df索引的拼接:

0fbafebab366e07e7a8a7021c4b220b7.png

传的话,不保留连接轴上的索引,产生一组新索引range(total_length)

27fc896b1d93a2e9f7e92f35571e78e8.png
这也是一个常用的关键字

一个示例,合并多个csv

  1. import os
  2. import pandas as pd
  3. path = r'C:UsersbyqpzDesktoptrade2017解压缩201710'
  4. dfList = []
  5. os.chdir(path)
  6. for filename in os.listdir(path):
  7. if os.path.splitext(filename)[1] == '.csv':
  8. df = pd.read_csv(filename,engine='python')
  9. dfList.append(df)
  10. df = pd.concat(dfList,sort=False)
  11. df.to_csv(path+'result.csv', encoding='utf_8_sig', index=False)

IV. 合并重叠数据——combine_first()

  1. Combine two DataFrame objects and default to non-null values in frame
  2. calling the method. Result index columns will be the union of the
  3. respective indexes and columns
  4. combine_first(other)
  5. other : DataFrame

combine_first( )方法用参数对象中的数据为调用者对象中的数据“打补丁”

274be762e45c91cef7b0f2df54b90026.png

4e65c3322b2eadf012fbd145ea1a8485.png

356dfd4100cbd7e63aacc1c205cd71f0.png

52c21edffe7f6c85261c2a74d60e6d2a.png

f8ce0d96f735c1bf5efd782117ba578c.png

915a67d1caf2e22c5aee2f01c31486de.png

302540640648b88a2ecc24aff54df1b1.png

V. df末尾追加数据——append

用于把'other'数据的行附加到this frame的末尾,返回一个附加数据后的新对象。不在this frame中的column将作为新列添加进来。

参数如下:

  1. Append rows of `other` to the end of this frame, returning a new
  2. object. Columns not in this frame are added as new columns.
  3. append(other, ignore_index=False, verify_integrity=False, sort=None)
  4. Parameters
  5. ----------
  6. other : DataFrame or Series/dict-like object, or list of these
  7. The data to append.
  8. ignore_index : boolean, default False
  9. If True, do not use the index labels.
  10. verify_integrity : boolean, default False
  11. If True, raise ValueError on creating index with duplicates.
  12. sort : boolean, default None
  13. Sort columns if the columns of `self` and `other` are not aligned.
  14. The default sorting is deprecated and will change to not-sorting
  15. in a future version of pandas. Explicitly pass ``sort=True`` to
  16. silence the warning and sort. Explicitly pass ``sort=False`` to
  17. silence the warning and not sort.
  18. .. versionadded:: 0.23.0
  19. Returns
  20. -------
  21. appended : DataFrame

dbefd3d443b4fb2e27520fe4613d4872.png

11d401f63f2e5ce6ab6542e2e1069558.png

746c4b93b87c70032163f50ff526ca03.png
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小蓝xlanll/article/detail/663931
推荐阅读
相关标签
  

闽ICP备14008679号