当前位置:   article > 正文

hive表存储为parquet格式_hive导出parquet

hive导出parquet

Hive0.13以后的版本

创建存储格式为parquet的hive表:

  1. CREATE TABLE parquet_test (
  2. id int,
  3. str string,
  4. mp MAP<STRING,STRING>,
  5. lst ARRAY<STRING>,
  6. strct STRUCT<A:STRING,B:STRING>)
  7. PARTITIONED BY (part string)
  8. STORED AS PARQUET;

测试:

本地生成parquet格式的文件

  1. >>> import numpy as np
  2. >>> import pandas as pd
  3. >>> import pyarrow as pa
  4. >>> df = pd.DataFrame({'one':['test','lisi','wangwu'], 'two': ['foo', 'bar', 'baz']})
  5. >>> table = pa.Table.from_pandas(df)
  6. >>> import pyarrow.parquet as pq
  7. >>> pq.write_table(table, 'example.parquet2')
  8. # 指定压缩格式
  9. # 默认使用的snappy >>> pq.write_table(table, 'example.parquet2', compression='snappy')
  10. # >>> pq.write_table(table, 'example.parquet2', compression='gzip')
  11. # >>> pq.write_table(table, 'example.parquet2', compression='brotli')
  12. # >>> pq.write_table(table, 'example.parquet2', compression='none')
  13. >>> table2 = pq.read_table('example.parquet2')
  14. >>> table2.to_pandas()
  15. one two
  16. 0 test foo
  17. 1 lisi bar
  18. 2 wangwu baz

Snappy压缩具有更好的性能,Gzip压缩具有更好的压缩比。

创建hive表并导入生成的parquet格式数据

  1. hive> create table parquet_example(one string, two string) STORED AS PARQUET;
  2. hive> load data local inpath './example.parquet2' overwrite into table parquet_example;
  3. hive> select * from parquet_example;
  4. OK
  5. test foo
  6. lisi bar
  7. wangwu baz
  8. Time taken: 0.071 seconds, Fetched: 3 row(s)

Hive Parquet配置

hive中支持对parquet的配置,主要有:

  1. parquet.compression
  2. parquet.block.size
  3. parquet.page.size

可以在Hive中直接set:

hive> set parquet.compression=snappy

控制Hive的block大小的参数:

  1. parquet.block.size
  2. dfs.blocksize
  3. mapred.max.split.size

 

参考:

Python读写Parquet格式:Reading and Writing the Apache Parquet Format

Hive支持Parquet格式:Parquet

 

 

 

 

声明:本文内容由网友自发贡献,转载请注明出处:【wpsshop】
推荐阅读
相关标签
  

闽ICP备14008679号