赞
踩
csv内容
$ cat test.txt
1|2|3|test
2|4|6|wwww
使用pyspark
from pyspark import SparkContext,SparkConf from pyspark.sql import SQLContext from pyspark.sql.types import * if __name__ == "__main__": sc = SparkContext(appName="CSV2Parquet") sqlContext = SQLContext(sc) schema = StructType([ StructField("id", StringType(), True), StructField("num1", StringType(), True), StructField("num2", StringType(), True), StructField("string", StringType(), True), ]) rdd = sc.textFile("/var/tmp/test.txt").map(lambda line: line.split("|")) df = sqlContext.createDataFrame(rdd, schema) df.write.parquet('/var/tmp/test.parq')
CDH提供parquet-tools命令查看parquet文件
parquet-tools cat sample.parq
parquet-tools head -n 2 sample.parq
parquet-tools schema sample.parq
parquet-tools meta sample.parq
parquet-tools dump
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。