赞
踩
I have a large number of csv data files, and each data file contains several days worth of tick data for one ticker in the following form :
ticker DD/MM/YYYY time bid ask
XXX, 19122014, 08:00:08.325, 9929.00,9933.00
XXX, 19122014, 08:00:08.523, 9924.00,9931.00
XXX, 19122014, 08:00:08.722, 9925.00,9930.50
XXX, 19122014, 08:00:08.921, 9924.00,9928.00
XXX, 19122014, 08:00:09.125, 9924.00,9928.00
…
XXX, 30122014, 21:56:25.181, 9795.50,9796.50
XXX, 30122014, 21:56:26.398, 9795.50,9796.50
XXX, 30122014, 21:56:26.598, 9795.50,9796.50
XXX, 30122014, 21:56:26.798, 9795.50,9796.50
XXX, 30122014, 21:56:28.896, 9795.50,9796.00
XXX, 30122014, 21:56:29.096, 9795.50,9796.50
XXX, 30122014, 21:56:29.296, 9795.50,9796.00
…
I need to extract any lines of data whose time is within certain range, say: 09:00:00 to 09:15:00. My current solution is simply reading in each data file to a data frame, sorting it in order by time and then using searchsorted to find 09:00:00 to 09:15:00. It works fine if performance isn't an issue and I don't have 1000 files waiting to be processed. Any suggestions on how to boost the speed? Thanks for help in advance!!!
解决方案
Short answer: put your data in an SQL database, and give the "time" column an index. You can't beat that with CSV files - using Pandas or not.
Without changing your CSV files, one thign a little bit faster, but not much would be to filter the rows as you read them - and have in memory just the ones that are interesting for you.
So instead of just getting the whole CSV into memory, a function like such could do the job:
import csv
def filter_time(filename, mintime, maxtime):
timecol = 3
reader = csv.reader(open(filename))
next(reader)
return [line for line in reader if mintime <= line[timecol] <= maxtime]
This task can be easilyt paralyzed - you could get some instances of this running concurrently before maxing the I/O on your device, I'd guess. One painless way to do that would be using the lelo Python package - it just provides you a @paralel decorator that makes the given function run in another process when called, and returns a lazy proxy for the results.
But that will still have to read everything in - I think the SQL solution should be about at least one order of magnitude faster.
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。