向pandas DataFrame添加一行_dataframe添加一行数据


本文翻译自:Add one row to pandas DataFrame

I understand that pandas is designed to load fully populated DataFrame but I need to create an empty DataFrame then add rows, one by one . 我知道pandas旨在加载完全填充的DataFrame但是我需要创建一个空的DataFrame,然后逐行添加行 What is the best way to do this ? 做这个的最好方式是什么 ?

I successfully created an empty DataFrame with : 我成功创建了一个空的DataFrame:

res = DataFrame(columns=('lib', 'qty1', 'qty2'))

Then I can add a new row and fill a field with : 然后,我可以添加新行,并使用以下字段填充字段:

res = res.set_value(len(res), 'qty1', 10.0)

It works but seems very odd :-/ (it fails for adding string value) 它有效,但看起来很奇怪:-/(添加字符串值失败)

How can I add a new row to my DataFrame (with different columns type) ? 如何将新行添加到DataFrame(具有不同的列类型)?




You could use pandas.concat() or DataFrame.append() . 您可以使用pandas.concat()DataFrame.append() For details and examples, see Merge, join, and concatenate . 有关详细信息和示例,请参见合并,联接和连接


In case you can get all data for the data frame upfront, there is a much faster approach than appending to a data frame: 如果可以预先获取该数据帧的所有数据,则有一种比附加到数据帧快得多的方法:

  1. Create a list of dictionaries in which each dictionary corresponds to an input data row. 创建一个词典列表,其中每个词典对应于一个输入数据行。
  2. Create a data frame from this list. 从此列表创建一个数据框。

I had a similar task for which appending to a data frame row by row took 30 min, and creating a data frame from a list of dictionaries completed within seconds. 我有一个类似的任务,需要花30分钟的时间逐行追加到数据帧,然后在几秒钟内完成的字典列表中创建数据帧。

  1. rows_list = []
  2. for row in input_rows:
  3. dict1 = {}
  4. # get input row in dictionary format
  5. # key = col_name
  6. dict1.update(blah..)
  7. rows_list.append(dict1)
  8. df = pd.DataFrame(rows_list)


For efficient appending see How to add an extra row to a pandas dataframe and Setting With Enlargement . 为了高效地附加,请参见如何向pandas数据框添加额外的行和“ 设置为放大”

Add rows through loc/ix on non existing key index data. 通过loc/ix不存在的键索引数据上添加行。 eg : 例如:

  1. In [1]: se = pd.Series([1,2,3])
  2. In [2]: se
  3. Out[2]:
  4. 0 1
  5. 1 2
  6. 2 3
  7. dtype: int64
  8. In [3]: se[5] = 5.
  9. In [4]: se
  10. Out[4]:
  11. 0 1.0
  12. 1 2.0
  13. 2 3.0
  14. 5 5.0
  15. dtype: float64

Or: 要么:

  1. In [1]: dfi = pd.DataFrame(np.arange(6).reshape(3,2),
  2. .....: columns=['A','B'])
  3. .....:
  4. In [2]: dfi
  5. Out[2]:
  6. A B
  7. 0 0 1
  8. 1 2 3
  9. 2 4 5
  10. In [3]: dfi.loc[:,'C'] = dfi.loc[:,'A']
  11. In [4]: dfi
  12. Out[4]:
  13. A B C
  14. 0 0 1 0
  15. 1 2 3 2
  16. 2 4 5 4
  17. In [5]: dfi.loc[3] = 5
  18. In [6]: dfi
  19. Out[6]:
  20. A B C
  21. 0 0 1 0
  22. 1 2 3 2
  23. 2 4 5 4
  24. 3 5 5 5


  1. >>> import pandas as pd
  2. >>> from numpy.random import randint
  3. >>> df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
  4. >>> for i in range(5):
  5. >>> df.loc[i] = ['name' + str(i)] + list(randint(10, size=2))
  6. >>> df
  7. lib qty1 qty2
  8. 0 name0 3 3
  9. 1 name1 2 4
  10. 2 name2 2 8
  11. 3 name3 2 1
  12. 4 name4 9 6


If you know the number of entries ex ante, you should preallocate the space by also providing the index (taking the data example from a different answer): 如果您事前知道条目数,则应该通过提供索引来预分配空间(从另一个答案中获取数据示例):

  1. import pandas as pd
  2. import numpy as np
  3. # we know we're gonna have 5 rows of data
  4. numberOfRows = 5
  5. # create dataframe
  6. df = pd.DataFrame(index=np.arange(0, numberOfRows), columns=('lib', 'qty1', 'qty2') )
  7. # now fill it up row by row
  8. for x in np.arange(0, numberOfRows):
  9. #loc or iloc both work here since the index is natural numbers
  10. df.loc[x] = [np.random.randint(-1,1) for n in range(3)]
  11. In[23]: df
  12. Out[23]:
  13. lib qty1 qty2
  14. 0 -1 -1 -1
  15. 1 0 0 0
  16. 2 -1 0 -1
  17. 3 0 -1 0
  18. 4 -1 0 0

Speed comparison 速度比较

  1. In[30]: %timeit tryThis() # function wrapper for this answer
  2. In[31]: %timeit tryOther() # function wrapper without index (see, for example, @fred)
  3. 1000 loops, best of 3: 1.23 ms per loop
  4. 100 loops, best of 3: 2.31 ms per loop

And - as from the comments - with a size of 6000, the speed difference becomes even larger: 而且-从注释中看-大小为6000,速度差异变得更大:

Increasing the size of the array (12) and the number of rows (500) makes the speed difference more striking: 313ms vs 2.29s 增加数组(12)的大小和行数(500)会使速度差异更加明显:313ms vs 2.29s

