赞
踩
提示(Hints)为用户提供了一种方法来建议Spark SQL如何使用特定的方法来生成其执行计划。
/*+ hint [ , ... ] */
分区提示允许用户建议Spark应该遵循的分区策略。支持COALESCE、REPARTITION和REPARTITION_BY_RANGE提示,它们分别相当于coalesce、repartition和repartitionByRange的 Dataset APIs。REBALANCE只能用作提示,这些提示给用户提供了一种优化性能和控制Spark SQL输出文件数量的方法。当指定多个分区提示时,会将多个节点插入到逻辑计划中,但优化器会选择最左边的提示。
SELECT /*+ COALESCE(3) */ * FROM t; SELECT /*+ REPARTITION(3) */ * FROM t; SELECT /*+ REPARTITION(c) */ * FROM t; SELECT /*+ REPARTITION(3, c) */ * FROM t; SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t; SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t; SELECT /*+ REBALANCE */ * FROM t; SELECT /*+ REBALANCE(3) */ * FROM t; SELECT /*+ REBALANCE(c) */ * FROM t; SELECT /*+ REBALANCE(3, c) */ * FROM t; -- multiple partitioning hints EXPLAIN EXTENDED SELECT /*+ REPARTITION(100), COALESCE(500), REPARTITION_BY_RANGE(3, c) */ * FROM t; == Parsed Logical Plan == 'UnresolvedHint REPARTITION, [100] +- 'UnresolvedHint COALESCE, [500] +- 'UnresolvedHint REPARTITION_BY_RANGE, [3, 'c] +- 'Project [*] +- 'UnresolvedRelation [t] == Analyzed Logical Plan == name: string, c: int Repartition 100, true +- Repartition 500, false +- RepartitionByExpression [c#30 ASC NULLS FIRST], 3 +- Project [name#29, c#30] +- SubqueryAlias spark_catalog.default.t +- Relation[name#29,c#30] parquet == Optimized Logical Plan == Repartition 100, true +- Relation[name#29,c#30] parquet == Physical Plan == Exchange RoundRobinPartitioning(100), false, [id=#121] +- *(1) ColumnarToRow +- FileScan parquet default.t[name#29,c#30] Batched: true, DataFilters: [], Format: Parquet, Location: CatalogFileIndex[file:/spark/spark-warehouse/t], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<name:string>
Join提示允许用户建议Spark应该使用的join策略。在Spark 3.0之前,仅支持BROADCAST Join Hint。3.0中增加了对MERGE、SHUFFLE_HASH和SHUFFLE_REPLICATE_NL Joint提示的支持。当在join的两侧都指定了不同的join策略提示时,Spark会按以下顺序对提示进行优先级排序:BROADCAST > MERGE > SHUFFLE_HASH > SHUFFLE_REPLICATE_NL。当两侧都使用BROADCAST提示或SHUFFLE_HASH提示时,Spark会根据join类型和relations的大小来选择构建侧。由于给定的策略可能不支持所有的join类型,因此不能保证Spark使用提示建议的join策略。
-- Join Hints for broadcast join SELECT /*+ BROADCAST(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key; SELECT /*+ BROADCASTJOIN (t1) */ * FROM t1 left JOIN t2 ON t1.key = t2.key; SELECT /*+ MAPJOIN(t2) */ * FROM t1 right JOIN t2 ON t1.key = t2.key; -- Join Hints for shuffle sort merge join SELECT /*+ SHUFFLE_MERGE(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key; SELECT /*+ MERGEJOIN(t2) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key; SELECT /*+ MERGE(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key; -- Join Hints for shuffle hash join SELECT /*+ SHUFFLE_HASH(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key; -- Join Hints for shuffle-and-replicate nested loop join SELECT /*+ SHUFFLE_REPLICATE_NL(t1) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key; -- When different join strategy hints are specified on both sides of a join, Spark -- prioritizes the BROADCAST hint over the MERGE hint over the SHUFFLE_HASH hint -- over the SHUFFLE_REPLICATE_NL hint. -- Spark will issue Warning in the following example -- org.apache.spark.sql.catalyst.analysis.HintErrorLogger: Hint (strategy=merge) -- is overridden by another hint and will not take effect. SELECT /*+ BROADCAST(t1), MERGE(t1, t2) */ * FROM t1 INNER JOIN t2 ON t1.key = t2.key;
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。