赞
踩
译者:季一帆
来源:AI研习社
Pandas
Dask
Datatable
Rapids
csv
feather
hdf5
jay
parquet
pickle
import pandas as pdimport dask.dataframe as dd# confirming the default pandas doesn't work (running thebelow code should result in a memory error)# data = pd.read_csv("../input/riiid-test-answer-prediction/train.csv")
%%timedtypes = { "row_id": "int64", "timestamp": "int64", "user_id": "int32", "content_id": "int16", "content_type_id": "boolean", "task_container_id": "int16", "user_answer": "int8", "answered_correctly": "int8", "prior_question_elapsed_time": "float32", "prior_question_had_explanation": "boolean"}data = pd.read_csv("../input/riiid-test-answer-prediction/train.csv", dtype=dtypes)print("Train size:", data.shape)Train size: (101230332, 10)CPU times: user 8min 11s, sys: 10.8 s, total: 8min 22sWall time: 8min 22sdata.head()
%%timedtypes = { "row_id": "int64", "timestamp": "int64", "user_id": "int32", "content_id": "int16", "content_type_id": "boolean", "task_container_id": "int16", "user_answer": "int8", "answered_correctly": "int8", "prior_question_elapsed_time": "float32", "prior_question_had_explanation": "boolean"}data = dd.read_csv("../input/riiid-test-answer-prediction/train.csv", dtype=dtypes).compute()print("Train size:", data.shape)Train size: (101230332, 10)CPU times: user 9min 24s, sys: 28.8 s, total: 9min 52sWall time: 7min 41sdata.head()
# datatable installation with internet# !pip install datatable==0.11.0 > /dev/null# datatable installation without internet!pip install ../input/python-datatable/datatable-0.11.0-cp37-cp37m-manylinux2010_x86_64.whl > /dev/nullimport datatable as dt%%timedata = dt.fread("../input/riiid-test-answer-prediction/train.csv")print("Train size:", data.shape)Train size: (101230332, 10)CPU times: user 52.5 s, sys: 18.4 s, total: 1min 10sWall time: 20.5 sdata.head()
# rapids installation (make sure to turn on GPU)import sys!cp ../input/rapids/rapids.0.15.0 /opt/conda/envs/rapids.tar.gz!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/nullsys.path = ["/opt/conda/envs/rapids/lib/python3.7/site-packages"] + sys.pathsys.path = ["/opt/conda/envs/rapids/lib/python3.7"] + sys.pathsys.path = ["/opt/conda/envs/rapids/lib"] + sys.pathimport cudf%%timedata = cudf.read_csv("../input/riiid-test-answer-prediction/train.csv")print("Train size:", data.shape)Train size: (101230332, 10)CPU times: user 4.58 s, sys: 3.31 s, total: 7.89 sWall time: 30.7 sdata.head()
# data = dt.fread("../input/riiid-test-answer-prediction/train.csv").to_pandas()# writing dataset as csv# data.to_csv("riiid_train.csv", index=False)# writing dataset as hdf5# data.to_hdf("riiid_train.h5", "riiid_train")# writing dataset as feather# data.to_feather("riiid_train.feather")# writing dataset as parquet# data.to_parquet("riiid_train.parquet")# writing dataset as pickle# data.to_pickle("riiid_train.pkl.gzip")# writing dataset as jay# dt.Frame(data).to_jay("riiid_train.jay")
数据集的所有格式可从此处获取,不包括竞赛组提供的原始csv数据。
https://www.kaggle.com/rohanrao/riiid-train-data-multiple-formats
%%timedtypes = { "row_id": "int64", "timestamp": "int64", "user_id": "int32", "content_id": "int16", "content_type_id": "boolean", "task_container_id": "int16", "user_answer": "int8", "answered_correctly": "int8", "prior_question_elapsed_time": "float32", "prior_question_had_explanation": "boolean"}data = pd.read_csv("../input/riiid-test-answer-prediction/train.csv", dtype=dtypes)print("Train size:", data.shape)Train size: (101230332, 10)CPU times: user 8min 36s, sys: 11.3 s, total: 8min 48sWall time: 8min 49s
%%timedata = pd.read_feather("../input/riiid-train-data-multiple-formats/riiid_train.feather")print("Train size:", data.shape)Train size: (101230332, 10)CPU times: user 2.59 s, sys: 8.91 s, total: 11.5 sWall time: 5.19 s
%%timedata = pd.read_hdf("../input/riiid-train-data-multiple-formats/riiid_train.h5", "riiid_train")print("Train size:", data.shape)Train size: (101230332, 10)CPU times: user 8.16 s, sys: 10.7 s, total: 18.9 sWall time: 19.8 s
%%timedata = dt.fread("../input/riiid-train-data-multiple-formats/riiid_train.jay")print("Train size:", data.shape)Train size: (101230332, 10)CPU times: user 4.88 ms, sys: 7.35 ms, total: 12.2 msWall time: 38 ms
%%timedata = pd.read_parquet("../input/riiid-train-data-multiple-formats/riiid_train.parquet")print("Train size:", data.shape)Train size: (101230332, 10)CPU times: user 29.9 s, sys: 20.5 s, total: 50.4 sWall time: 27.3 s
%%timedata = pd.read_pickle("../input/riiid-train-data-multiple-formats/riiid_train.pkl.gzip")print("Train size:", data.shape)Train size: (101230332, 10)CPU times: user 5.65 s, sys: 7.08 s, total: 12.7 sWall time: 15 s
Pandas在处理大规模数据时对RAM的需求增加
Dask有时很慢,尤其是在无法并行化的情况下
Datatable没有丰富的数据处理功能
Rapids只适用于GPU
在不断更新的开源软件包和活跃的社区支持下,数据科学必将持续蓬勃发展。
AI研习社是AI学术青年和开发者社区,为大家提供一个顶会资讯、论文解读、数据竞赛、求职内推等的技术交流阵地,欢迎登陆www.yanxishe.com加入我们吧~
投稿、转载、媒介合作联系微信号 | bajiaojiao-sz商务合作联系微信号 | LJ18825253481
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。