赞
踩
Hive0.13以后的版本
创建存储格式为parquet的hive表:
- CREATE TABLE parquet_test (
- id int,
- str string,
- mp MAP<STRING,STRING>,
- lst ARRAY<STRING>,
- strct STRUCT<A:STRING,B:STRING>)
- PARTITIONED BY (part string)
- STORED AS PARQUET;
测试:
本地生成parquet格式的文件
- >>> import numpy as np
- >>> import pandas as pd
- >>> import pyarrow as pa
- >>> df = pd.DataFrame({'one':['test','lisi','wangwu'], 'two': ['foo', 'bar', 'baz']})
- >>> table = pa.Table.from_pandas(df)
- >>> import pyarrow.parquet as pq
- >>> pq.write_table(table, 'example.parquet2')
- # 指定压缩格式
- # 默认使用的snappy >>> pq.write_table(table, 'example.parquet2', compression='snappy')
- # >>> pq.write_table(table, 'example.parquet2', compression='gzip')
- # >>> pq.write_table(table, 'example.parquet2', compression='brotli')
- # >>> pq.write_table(table, 'example.parquet2', compression='none')
- >>> table2 = pq.read_table('example.parquet2')
- >>> table2.to_pandas()
- one two
- 0 test foo
- 1 lisi bar
- 2 wangwu baz

Snappy压缩具有更好的性能,Gzip压缩具有更好的压缩比。
创建hive表并导入生成的parquet格式数据
- hive> create table parquet_example(one string, two string) STORED AS PARQUET;
- hive> load data local inpath './example.parquet2' overwrite into table parquet_example;
- hive> select * from parquet_example;
- OK
- test foo
- lisi bar
- wangwu baz
- Time taken: 0.071 seconds, Fetched: 3 row(s)
Hive Parquet配置
hive中支持对parquet的配置,主要有:
- parquet.compression
- parquet.block.size
- parquet.page.size
可以在Hive中直接set:
hive> set parquet.compression=snappy
控制Hive的block大小的参数:
- parquet.block.size
- dfs.blocksize
- mapred.max.split.size
参考:
Python读写Parquet格式:Reading and Writing the Apache Parquet Format;
Hive支持Parquet格式:Parquet;
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。