赞
踩
在项目中遇到客户已经下载下来的【20M-10G】大的CSV数据需要用pandas处理建模,在pandas加载中发现,加载速度很慢,直接读取有时甚至会超出内存,测试了网上提到的一些加速处理方法,在此汇总记录(由于参考其他文章时,未及时记录,如有侵权联系删除)。
windows10、pandas版本: 1.5.3、12th Gen Intel® Core™ i5-12600KF 3.70 GHz、RAM16GB
import pandas as pd
import time
from sys import getsizeof
time_start = time.time()
data = pd.read_csv("../data/input/test_data.csv", encoding="gbk",engine="python")
time_end = time.time()
print("耗时{}秒".format(time_end - time_start))
print("数据共{}kB".format(round(getsizeof(data) / 1024, 2)))
输出结果为:
耗时27.383215188980103秒
数据共1591406.19kB
time_start = time.time()
data = pd.read_csv("../data/input/test_data.csv", encoding="gbk",engine="c")
time_end = time.time()
print("耗时{}秒".format(time_end - time_start))
print("数据共{}kB".format(round(getsizeof(data) / 1024, 2)))
输出结果为:
耗时5.371628665924072秒
数据共1591406.19kB
read_csv 中的c引擎比python引擎快5倍左右!
pandas中整数默认使用int64,浮点数使用float64,对于一些数据我们可以在读取时指定合适的存储数据格式,不仅可以提高读取的速度,还可以减小内存
data = pd.read_csv("../data/input/test_data.csv", encoding="gbk", engine="c")
print(data.dtypes)
输出结果为:
测试1 object
测试2 object
测试3 int64
测试4 object
测试5 float64
测试6 int64
测试7 float64
测试8 float64
dtype: object
read_csv中设置dtype参数
time_start = time.time()
data = pd.read_csv("../data/input/test_data.csv", encoding="gbk",engine="c",
dtype = {"测试3": np.int8, "测试5": np.float16, "测试6": np.int8, "测试7": np.float16, "测试8": np.float16})
time_end = time.time()
print("耗时{}秒".format(time_end - time_start))
print("数据共{}kB".format(round(getsizeof(data) / 1024, 2)))
输出结果为:
测试1 object
测试2 object
测试3 int8
测试4 object
测试5 float16
测试6 int8
测试7 float16
测试8 float16
dtype: object
耗时5.005128908157349秒
数据共1441526.0kB
有时候文件中的数据并不是全部需要,这时候我们可以只加载需要的数据,这样不仅可以提高加载速度,还可以减少内存占用。当数据量特别大时,我们可以使用read_csv中的chunksize参数先读取部分数据,显示数据字段名,然后使用usecols参数进行有选择的加载数据。
data = pd.read_csv("../data/input/test-data.csv",encoding="gbk", engine="c", chunksize=30)
for part_data in data:
print(part_data.columns)
break
输出结果为:
Index(['测试1', '测试2', '测试3', '测试4', '测试5', '测试6', '测试7', '测试8'], dtype='object')
根据需要的字段名,只加载需要的数据
time_start = time.time()
data = pd.read_csv("../data/input/test-data.csv",encoding="gbk", engine="c", usecols=["测试1"])
time_end = time.time()
print("耗时{}秒".format(time_end - time_start))
print("数据共{}kB".format(round(getsizeof(data) / 1024, 2)))
输出结果为:
耗时3.5131638050079346秒
数据共355965.88kB
可以看到加载速度和占用内存都大大减小。
pip install polars
import time import polars as pl from sys import getsizeof time_start = time.time() data = pl.read_csv("../data/input/test-data.csv", encoding="utf8-lossy") end_time = time.time() print("耗时{}秒".format(end_time - time_start)) print(type(data)) print(data.dtypes) print(data) print("data对象共{}kB".format(round(getsizeof(data) / 1024, 2))) pandas_df = data.to_pandas() memory_usage_bytes = pandas_df.memory_usage(deep=True).sum() memory_usage_kb = round(memory_usage_bytes / 1024, 2) print("数据共{}KB".format(memory_usage_kb))
输出结果为:
耗时0.07973313331604004秒 <class 'polars.dataframe.frame.DataFrame'> [Int64, Utf8, Utf8, Int64, Utf8, Float64, Int64, Float64, Float64] shape: (186_958, 9) ┌────────┬─────────────────────┬─────────────────────┬───────┬───┬────────┬───────┬───────┬───────┐ │ ┆ ����1 ┆ ����2 ┆ ����3 ┆ … ┆ ����5 ┆ ����6 ┆ ����7 ┆ ����8 │ │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str ┆ i64 ┆ ┆ f64 ┆ i64 ┆ f64 ┆ f64 │ ╞════════╪═════════════════════╪═════════════════════╪═══════╪═══╪════════╪═══════╪═══════╪═══════╡ │ 0 ┆ 2023-03-09 00:00:00 ┆ 2023-03-09 01:00:00 ┆ 205 ┆ … ┆ 1.1236 ┆ 3 ┆ 0.01 ┆ 0.23 │ │ 1 ┆ 2023-03-09 00:00:00 ┆ 2023-03-09 01:00:00 ┆ 201 ┆ … ┆ 0.0 ┆ 0 ┆ 0.0 ┆ 0.0 │ │ 2 ┆ 2023-03-09 00:00:00 ┆ 2023-03-09 01:00:00 ┆ 202 ┆ … ┆ 0.0 ┆ 0 ┆ 0.0 ┆ 0.0 │ │ 3 ┆ 2023-03-09 00:00:00 ┆ 2023-03-09 01:00:00 ┆ 203 ┆ … ┆ 0.1947 ┆ 2 ┆ 0.0 ┆ 0.01 │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ 186954 ┆ 2023-06-05 22:00:00 ┆ 2023-06-05 23:00:00 ┆ 19 ┆ … ┆ 0.1172 ┆ 4 ┆ 0.0 ┆ 0.01 │ │ 186955 ┆ 2023-06-05 22:00:00 ┆ 2023-06-05 23:00:00 ┆ 20 ┆ … ┆ 0.4344 ┆ 18 ┆ 0.0 ┆ 0.04 │ │ 186956 ┆ 2023-06-05 22:00:00 ┆ 2023-06-05 23:00:00 ┆ 21 ┆ … ┆ 1.0528 ┆ 21 ┆ 0.01 ┆ 0.07 │ │ 186957 ┆ 2023-06-05 22:00:00 ┆ 2023-06-05 23:00:00 ┆ 22 ┆ … ┆ 0.6484 ┆ 17 ┆ 0.01 ┆ 0.16 │ └────────┴─────────────────────┴─────────────────────┴───────┴───┴────────┴───────┴───────┴───────┘ data对象共0.05kB 数据共67628.84KB Process finished with exit code 0
从测试结果看同一份数据polars读取数据速度比pandas中的read_csv快将近50倍! ! !。
在使用polars直接读取数据的时候有时候会报错:
exceptions.ComputeError: invalid utf-8 sequence in csv
这个错误可以在polars读取数据的时候将encoding参数设置成"utf8-lossy",这样一些无法编码的字符会使用�
代替。
data = pl.read_csv("../data/input/test-data.csv", encoding="utf8-lossy")
pip install datatable
import time
import datatable as dt
from sys import getsizeof
time_start = time.time()
data = dt.fread("../data/input/test-data.csv", encoding="gbk")
end_time = time.time()
print("耗时{}秒".format(end_time - time_start))
print("datatable版本:", dt.__version__)
print(type(data))
print("数据共{}kB".format(round(getsizeof(data) / 1024, 2)))
输出结果为:
耗时0.11500072479248047秒
datatable版本: 1.0.0
<class 'datatable.Frame'>
数据共22585.8kB
从测试结果来看datatable读取数据的速度和polars差不多。
在安装datatable时,遇到下面错误,经查看datatable文档https://datatable.readthedocs.io/en/latest/releases/v1.0.0.html,1.0.0版本的datatable支持的python版本为python3.9、python3.8、python3.7、python3.6,我的python版本为python3.10,降低python版本成功安装,若遇到相同错误可查看datatable文档,查看最新版的datatable支持的python版本。
Collecting datatable Using cached datatable-1.0.0.tar.gz (1.1 MB) Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... error error: subprocess-exited-with-error × Preparing metadata (pyproject.toml) did not run successfully. │ exit code: 1 ╰─> [22 lines of output] Traceback (most recent call last): File "D:\anconda\lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 353, in <module> main() File "D:\anconda\lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 335, in main json_out['return_val'] = hook(**hook_input['kwargs']) File "D:\anconda\lib\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 152, in prepare_metadata_for_build_wheel whl_basename = backend.build_wheel(metadata_directory, config_settings) File "C:\Users\Administrator\AppData\Local\Temp\pip-install-r7zihkkb\datatable_259142d33cfc4c0cb26d395c8dfed6fd\ci\ext.py", line 573, in build_wheel generate_build_info(flavor, strict=not is_source_distribution()) File "C:\Users\Administrator\AppData\Local\Temp\pip-install-r7zihkkb\datatable_259142d33cfc4c0cb26d395c8dfed6fd\ci\ext.py", line 467, in generate_build_info git_hash = shell_cmd(["git", "rev-parse", "HEAD"], strict=strict) File "C:\Users\Administrator\AppData\Local\Temp\pip-install-r7zihkkb\datatable_259142d33cfc4c0cb26d395c8dfed6fd\ci\ext.py", line 437, in shell_cmd return subprocess.check_output(cmd, universal_newlines=True, File "D:\anconda\lib\subprocess.py", line 421, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "D:\anconda\lib\subprocess.py", line 503, in run with Popen(*popenargs, **kwargs) as process: File "D:\anconda\lib\subprocess.py", line 971, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "D:\anconda\lib\subprocess.py", line 1440, in _execute_child hp, ht, pid, tid = _winapi.CreateProcess(executable, args, FileNotFoundError: [WinError 2] 系统找不到指定的文件。 [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed × Encountered error while generating package metadata. ╰─> See above for output. note: This is an issue with the package mentioned above, not pip. hint: See above for details.
pandas官方pandas文档推荐了一个并行处理大级别数据的库dask,该库采用并行的方法,可快速处理大量数据,更多使用方法大家可以查看pandas文档,里面有更详细的案例介绍和使用方法介绍。
pip install dask
import time
import dask
import dask.dataframe as dd
from sys import getsizeof
time_start = time.time()
data = dd.read_csv("../data/input/test-data.csv", encoding="gbk")
end_time = time.time()
print("耗时{}秒".format(end_time - time_start))
print("dask版本:", dask.__version__)
print(type(data))
print("data对象共{}kB".format(round(getsizeof(data) / 1024, 2)))
print("数据共{}kB".format(round(getsizeof(data.head(len(data)-1)) / 1024, 2)))
print(data)
输出结果为:
耗时0.005980253219604492秒
dask版本: 2023.6.0
<class 'dask.dataframe.core.DataFrame'>
data对象共0.05kB
数据共64114.47kB
Dask DataFrame Structure:
Unnamed: 0 测试1 测试2 测试3 测试4 测试5 测试6 测试7 测试8
npartitions=1
int64 object object int64 object float64 int64 float64 float64
... ... ... ... ... ... ... ... ...
Dask Name: read-csv, 1 graph layer
Process finished with exit code 0
从测试结果看dask读取数据的速度将近read_csv的1000倍,datatable和polars的20倍!!!!,速度惊人!!
1、工作中若是无法要求数据保存的格式,当遇到较大数据量级的数据需要采用python进行建模处理时,对于csv文件可以使用dask、polars、datatable等第三方库进行协助操作,另外pandas库中相关的函数参数设置也可以提高我们数据读取的速度。
2、如果对数据的保存格式有自主权,可尝试pickle、hdf5格式、极其不建议存成xlsx格式,本人测试该格式数据读取速度非常慢!!!。
欢迎相关小伙伴进群交流学习qq群:865416910
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。