赞
踩
@Author : Spinach | GHB
@Link : http://blog.csdn.net/bocai8058
set hive.exec.dynamic.partition=true
)形式写入时,会产生大量的小文件;HDFS中文件的存储是以文件块(block)的形式存储在DataNode中的,block块位置信息等元数据信息是存储在namenode中的,在实际生产中,一般block块的大小设置为256MB。具体关于block块疑问,可参考《HDFS Block块大小限定依据及原则》
## 设置map输入合并小文件的相关参数
set mapred.max.split.size=256000000;
set mapred.min.split.size.per.node=100000000;
set mapred.min.split.size.per.rack=100000000;
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
## 设置map输出和reduce输出进行合并的相关参数
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=256000000;
set hive.merge.smallfiles.avgsize=256000000;
insert overwrite table tableName partition(date) select * from person DISTRIBUTE BY (rand()*3);
alter table tableName_orc partition (date="20200101") concatenate;
.config("spark.sql.adaptive.enabled", "true")
;repartition
操作或coalesce
操作;/*+ REPARTITION(分区数) */
),例如:create table tableName as select /*+ REPARTITION(10) */ age,name from person where date='20200101'
insert into table tableName select /*+ REPARTITION(10) */ age,name from person where date='20200101'
insert overwrite table tableName select /*+ REPARTITION(10) */ age,name from person where date='20200101'
注意:分区数的设置根据落地数据量而定
distribute by rand()
将数据随机分配给Reduce,这样可以使得每个Reduce处理的数据大体一致,例如:insert overwrite table tableName partition(date) select * from person DISTRIBUTE BY rand()
spark.table(dwdTableName).where(s"""statis_date='$statis_date'""").coalesce(5).distinct().write.mode(SaveMode.Overwrite).insertInto(dwdTableName)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。