赞
踩
pickle
的序列化和反序列化python对象(如pandas df转为spark df速度很慢,就是因为时间耗在数据序列化上了)spark.conf.set(“spark.sql.execution.arrow.enabled”,“true”
。conda install -c conda-forge pyarrow或pip install pyarrow
PyArrow
进行对df分块读入然后使用dataloaderpa.Table.from_pandas(all_predcit)
可以将pd.df格式的all_predict
转为pyarrow.lib.Table
的dfpyarrow.parquet.write_table
函数可以保存pyarrow.lib.Table
格式的dfimport pyarrow as pa import pyarrow.parquet as pq from pyarrow.parquet import write_table df_test_all6 = pq.ParquetFile(df_test_all6_parquet) # 逐块读取数据 i = 0 out_path = "data/pq_predict_ans.parquet" for batch in df_test_all6.iter_batches(): batch_df = batch.to_pandas() print("batch_df test:\n", batch_df) beat_dense_features, beat_sparse_features, beat_train_dataloader, \ beat_val_dataloader, beat_test_dataloader = pq_dataloader_ans(batch_df) # model已经设置为model.eval() batch_y_pred = predict_model(model, beat_test_dataloader) batch_df['predict_prob'] = batch_y_pred if i == 0: all_predcit = batch_df all_predcit = pa.Table.from_pandas(all_predcit) else: batch_df = pa.Table.from_pandas(batch_df) # all_predcit = pq.write_table(all_predcit + batch_df, out_path) all_predcit = pa.concat_tables([all_predcit, batch_df]) i = i + 1 # save prediction result pq.write_table(all_predcit, out_path)
[1] Python parquet.write_table方法代码示例
[2] python中对arrow库的总结
[3] 官方文档:pyarrow.parquet.write_table
[4] Is it possible to read parquet files in chunks?-stackoverflow
[5] 官方文档:pyarrow.concat_tables
[6] https://github.com/apache/arrow/issues/2192
[7] Using Apache PyArrow to optimize Spark & Pandas DataFrames conversions
[8] Python之pyarrow:pyarrow的简介、安装、使用方法之详细攻略
[9] pandas读取大量数据的分块处理
[10] https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.columns
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。