当前位置:   article > 正文

python提取前几行数据,从python中的大型csv数据文件中提取少量数据行的高效方法...

python 提取前20行

I have a large number of csv data files, and each data file contains several days worth of tick data for one ticker in the following form :

ticker DD/MM/YYYY time bid ask

XXX, 19122014, 08:00:08.325, 9929.00,9933.00

XXX, 19122014, 08:00:08.523, 9924.00,9931.00

XXX, 19122014, 08:00:08.722, 9925.00,9930.50

XXX, 19122014, 08:00:08.921, 9924.00,9928.00

XXX, 19122014, 08:00:09.125, 9924.00,9928.00

XXX, 30122014, 21:56:25.181, 9795.50,9796.50

XXX, 30122014, 21:56:26.398, 9795.50,9796.50

XXX, 30122014, 21:56:26.598, 9795.50,9796.50

XXX, 30122014, 21:56:26.798, 9795.50,9796.50

XXX, 30122014, 21:56:28.896, 9795.50,9796.00

XXX, 30122014, 21:56:29.096, 9795.50,9796.50

XXX, 30122014, 21:56:29.296, 9795.50,9796.00

I need to extract any lines of data whose time is within certain range, say: 09:00:00 to 09:15:00. My current solution is simply reading in each data file to a data frame, sorting it in order by time and then using searchsorted to find 09:00:00 to 09:15:00. It works fine if performance isn't an issue and I don't have 1000 files waiting to be processed. Any suggestions on how to boost the speed? Thanks for help in advance!!!

解决方案

Short answer: put your data in an SQL database, and give the "time" column an index. You can't beat that with CSV files - using Pandas or not.

Without changing your CSV files, one thign a little bit faster, but not much would be to filter the rows as you read them - and have in memory just the ones that are interesting for you.

So instead of just getting the whole CSV into memory, a function like such could do the job:

import csv

def filter_time(filename, mintime, maxtime):

timecol = 3

reader = csv.reader(open(filename))

next(reader)

return [line for line in reader if mintime <= line[timecol] <= maxtime]

This task can be easilyt paralyzed - you could get some instances of this running concurrently before maxing the I/O on your device, I'd guess. One painless way to do that would be using the lelo Python package - it just provides you a @paralel decorator that makes the given function run in another process when called, and returns a lazy proxy for the results.

But that will still have to read everything in - I think the SQL solution should be about at least one order of magnitude faster.

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/weixin_40725706/article/detail/210143
推荐阅读
相关标签
  

闽ICP备14008679号