当前位置:   article > 正文

python多进程加快for循环,在Python中使用循环多进程共享大熊猫DataFrame

python多进程执行同一个dataframe

Using Python 2.7 on a Windows machine, I have a large pandas DataFrame (about 7 million rows and 20+ columns) from a SQL query that I'd like to filter by looping through IDs then run calculations on the resulting filtered data. I'd also like to do this in parallel.

I know that if I try to do this with standard methods from the multiprocessing package in Windows, each process will generate a new instance of that large DataFrame for its own use and my memory will be eaten up. So I'm trying to use information I've read on remote managers to make my DataFrame a proxy object and share that across each process but I'm struggling to make it work.

My code is below, and I can get it to work on a single for loop no problem, but again the memory gets eaten up if I make it a parallel process:

import multiprocessing

import pandas

import pyodbc

def download(args):

"""pydobc code to download data from sql database"""

def calc(dataset, index):

filter_data = dataset[dataset['ID'] == index]

"""run calculations on filtered DataFrame"""

"""append results to local csv"""

if __name__ == '__main__':

data_1 = download(args_1)

data_2 = download(args_2)

all_data = data_1.append(data_2) #Append downloaded DataFrames into one

unique_id = pandas.unique(all_data['ID'])

pool = multiprocessing.Pool()

[pool.apply_async(calc, args=(all_data, x) ) for x in unique_id ]

解决方案Q : "Sharing large pandas DataFrame with multiprocessing for loop in Python ?"

While there are tools to share some data in the multiprocessing module, the actual use will here actually represent an anti-pattern to the presented will to operate this, for performance reasons, inside a Pool-instance, in a "just"-[CONCURRENT]-fashion.

Why?

You spend immense costs on moving the filtering into a Pool-of-independent ( "just"-[CONCURRENT] ) workers, yet each of them is waiting to get served by, again the central GIL-lock, which turns the Manager's work again into a pure-[SERIAL] and even worse, being RAM I/O-bound, the performance suffocation from having no free access to RAM, goes principally in a wrong direction ).

THE ECONOMY OF ADD-ON COSTS v/s THE TRAP of AMDAHL's LAW :

The speed of burning the money ( add-on costs ), that are not visible from a few SLOC-s can be ( and often is) way higher, than any ( only potential, until well engineered, tuned and validated ) in-vivo performance benefit, from operating several lines of code-execution in a "just"-[CONCURRENT] ( the harder for a True-[PARALLEL] ) fashion.

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Monodyee/article/detail/74826
推荐阅读
相关标签
  

闽ICP备14008679号