当前位置:   article > 正文

【Spark】【Spark Configuration 】【Spark配置】_spark 客户端配置

spark 客户端配置

目录

Spark Properties

Dynamically Loading Spark Properties动态加载Spark属性

Viewing Spark Properties查看Spark属性

Available Properties 可用属性 

Application Properties 应用性能 

Runtime Environment 运行时环境 

Shuffle Behavior 洗牌行为 

Spark UI

Compression and Serialization压缩和序列化

Memory Management 存储器管理 

Execution Behavior 执行行为 

Executor Metrics 执行人 

Networking 联网 

Scheduling 调度 

Barrier Execution Mode 屏障执行模式 

Dynamic Allocation 动态分配 

Thread Configurations 螺纹结构 

Spark Connect

Server Configuration 服务器配置 

Security 安全 

Spark SQL

Runtime SQL ConfigurationSQL配置

Static SQL Configuration静态SQL配置

Spark Streaming

SparkR

GraphX

Deploy 部署 

Cluster Managers 联网经理 

YARN  

Mesos

Kubernetes

Standalone Mode  独立模式

Environment Variables 环境变量 

Configuring Logging 配置日志记录 

Overriding configuration directory覆盖配置目录

Inheriting Hadoop Cluster Configuration继承Hadoop群集配置

Custom Hadoop/Hive Configuration自定义Hadoop/Hive配置

Custom Resource Scheduling and Configuration Overview自定义资源调度和配置概述

Stage Level Scheduling Overview阶段级计划概述

Push-based shuffle overview基于推送的随机播放概述

External Shuffle service(server) side configuration options外部Shuffle服务(服务器)端配置选项

Client side configuration options客户端配置选项


Spark provides three locations to configure the system:
Spark提供了三个位置来配置系统:

  • Spark properties control most application parameters and can be set by using a SparkConf object, or through Java system properties.
    Spark属性控制大多数应用程序参数,可以使用SparkConf对象或通过Java系统属性进行设置。
  • Environment variables can be used to set per-machine settings, such as the IP address, through the conf/spark-env.sh script on each node.
    环境变量可用于通过每个节点上的 conf/spark-env.sh 脚本设置每台计算机的设置,例如IP地址。
  • Logging can be configured through log4j2.properties.
    日志可以通过 log4j2.properties 配置。

Spark Properties

Spark properties control most application settings and are configured separately for each application. These properties can be set directly on a SparkConf passed to your SparkContextSparkConf allows you to configure some of the common properties (e.g. master URL and application name), as well as arbitrary key-value pairs through the set() method. For example, we could initialize an application with two threads as follows:
Spark属性控制大多数应用程序设置,并为每个应用程序单独配置。这些属性可以直接在传递给你的 SparkContext 的SparkConf上设置。 SparkConf 允许您配置一些常见属性(例如主URL和应用程序名称),以及通过 set() 方法配置任意键值对。例如,我们可以用两个线程初始化一个应用程序,如下所示:

Note that we run with local[2], meaning two threads - which represents “minimal” parallelism, which can help detect bugs that only exist when we run in a distributed context.
请注意,我们使用local[2]运行,这意味着两个线程-这表示“最小”并行性,这可以帮助检测只有在分布式上下文中运行时才存在的错误。

  1. val conf = new SparkConf()
  2. .setMaster("local[2]")
  3. .setAppName("CountingSheep")
  4. val sc = new SparkContext(conf)

Note that we can have more than 1 thread in local mode, and in cases like Spark Streaming, we may actually require more than 1 thread to prevent any sort of starvation issues.
请注意,我们可以在本地模式下有多个线程,在Spark Streaming这样的情况下,我们实际上可能需要多个线程来防止任何类型的饥饿问题。

Properties that specify some time duration should be configured with a unit of time. The following format is accepted:
指定某个持续时间的属性应配置为时间单位。接受以下格式:

  1. <span style="color:#666666"><span style="color:#333333"><span style="background-color:#f5f5f5"><code>25ms (milliseconds)
  2. 5s (seconds)
  3. 10m or 10min (minutes)
  4. 3h (hours)
  5. 5d (days)
  6. 1y (years)
  7. </code></span></span></span>

Properties that specify a byte size should be configured with a unit of size. The following format is accepted:
指定字节大小的属性应配置为使用大小单位。接受以下格式:

  1. <span style="color:#666666"><span style="color:#333333"><span style="background-color:#f5f5f5"><code>1b (bytes)
  2. 1k or 1kb (kibibytes = 1024 bytes)
  3. 1m or 1mb (mebibytes = 1024 kibibytes)
  4. 1g or 1gb (gibibytes = 1024 mebibytes)
  5. 1t or 1tb (tebibytes = 1024 gibibytes)
  6. 1p or 1pb (pebibytes = 1024 tebibytes)
  7. </code></span></span></span>

While numbers without units are generally interpreted as bytes, a few are interpreted as KiB or MiB. See documentation of individual configuration properties. Specifying units is desirable where possible.
虽然没有单位的数字通常被解释为字节,但少数被解释为KiB或MiB。请参阅各个配置属性的文档。在可能的情况下,最好使用可调单位。

Dynamically Loading Spark Properties
动态加载Spark属性

In some cases, you may want to avoid hard-coding certain configurations in a SparkConf. For instance, if you’d like to run the same application with different masters or different amounts of memory. Spark allows you to simply create an empty conf:
在某些情况下,您可能希望避免在 SparkConf 中硬编码某些配置。例如,如果你想用不同的主机或不同的内存量运行同一个应用程序。Spark允许你简单地创建一个空的conf:

val sc = new SparkContext(new SparkConf())

Then, you can supply configuration values at runtime:
然后,您可以在运行时提供配置值:

  1. ./bin/spark-submit --name "My app" --master local[4] --conf spark.eventLog.enabled=false
  2. --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar

The Spark shell and spark-submit tool support two ways to load configurations dynamically. The first is command line options, such as --master, as shown above. spark-submit can accept any Spark property using the --conf/-c flag, but uses special flags for properties that play a part in launching the Spark application. Running ./bin/spark-submit --help will show the entire list of these options.
Spark shell和 spark-submit 工具支持两种动态加载配置的方式。第一个是命令行选项,例如 --master ,如上所示。 spark-submit 可以接受任何使用 --conf/-c 标志的Spark属性,但对于在启动Spark应用程序中发挥作用的属性使用特殊标志。运行 ./bin/spark-submit --help 将显示这些选项的完整列表。

bin/spark-submit will also read configuration options from conf/spark-defaults.conf, in which each line consists of a key and a value separated by whitespace. For example:
bin/spark-submit 还将从 conf/spark-defaults.conf 读取配置选项,其中每行由空格分隔的键和值组成。举例来说:

  1. <span style="color:#666666"><span style="color:#333333"><span style="background-color:#f5f5f5"><code>spark.master spark://5.6.7.8:7077
  2. spark.executor.memory 4g
  3. spark.eventLog.enabled true
  4. spark.serializer org.apache.spark.serializer.KryoSerializer
  5. </code></span></span></span>

Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
任何指定为标志或属性文件中的值都将传递给应用程序,并与通过SparkConf指定的值合并。直接在SparkConf上设置的属性具有最高优先级,然后是传递给 spark-submit 或 spark-shell 的标志,然后是 spark-defaults.conf 文件中的选项。自早期版本的Spark以来,一些配置键已经被重命名;在这种情况下,旧的键名称仍然被接受,但优先级低于新键的任何实例。

Spark properties mainly can be divided into two kinds: one is related to deploy, like “spark.driver.memory”, “spark.executor.instances”, this kind of properties may not be affected when setting programmatically through SparkConf in runtime, or the behavior is depending on which cluster manager and deploy mode you choose, so it would be suggested to set through configuration file or spark-submit command line options; another is mainly related to Spark runtime control, like “spark.task.maxFailures”, this kind of properties can be set in either way.
Spark属性主要分为两类:一类是与部署相关的,如“spark.driver.memory”、“spark.executor.instances”,这类属性在运行时通过 SparkConf 编程设置可能不会受到影响,或者行为取决于您选择的集群管理器和部署模式,建议通过配置文件或 spark-submit 命令行选项设置;另一种主要是与Spark运行时控制有关,如“spark.task.maxFailures”,这类属性可以通过两种方式设置。

Viewing Spark Properties
查看Spark属性

The application web UI at http://<driver>:4040 lists Spark properties in the “Environment” tab. This is a useful place to check to make sure that your properties have been set correctly. Note that only values explicitly specified through spark-defaults.confSparkConf, or the command line will appear. For all other configuration properties, you can assume the default value is used.
在 http://<driver>:4040 的应用程序Web UI中,在“Environment”选项卡中列出了Spark属性。这是一个很有用的检查位置,可以确保您的属性设置正确。请注意,只有通过 spark-defaults.conf 、 SparkConf 或命令行显式指定的值才会显示。对于所有其他配置属性,可以假定使用默认值。

Available Properties 可用属性 

Most of the properties that control internal settings have reasonable default values. Some of the most common options to set are:
大多数控制内部设置的属性都有合理的默认值。一些最常见的设置选项是:

Application Properties 应用性能 

Property Name 属性名称 Default 默认 Meaning 意义 Since Version 从版本
spark.app.name (none) (无)

The name of your application.

This will appear in the UI and in log data.
您的应用程序的名称。

这将出现在UI和日志数据中。

0.9.0
spark.driver.cores 1 Number of cores to use for the driver process, only in cluster mode.
用于驱动程序进程的内核数量,仅在群集模式下。
1.3.0
spark.driver.maxResultSize 1g Limit of total size of serialized results of all partitions for each Spark action (e.g. collect) in bytes. Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size is above this limit. Having a high limit may cause out-of-memory errors in driver (depends on spark.driver.memory and memory overhead of objects in JVM). Setting a proper limit can protect the driver from out-of-memory errors.
每个Spark操作(例如collect)的所有分区的序列化结果的总大小限制(以字节为单位)。应该至少为1M,或0表示无限制。如果总大小超过此限制,作业将被中止。设置高限制可能会导致驱动程序中出现内存不足错误(取决于spark.driver.memory和JVM中对象的内存开销)。设置适当的限制可以保护驱动程序免受内存不足错误的影响。
1.2.0
spark.driver.memory 1g Amount of memory to use for the driver process, i.e. where SparkContext is initialized, in the same format as JVM memory strings with a size unit suffix ("k", "m", "g" or "t") (e.g. 512m2g).
用于驱动程序进程的内存量,即初始化SparkContext的位置,格式与JVM内存字符串相同,带有大小单位后缀(“k”,“m”,“g”或“t”)(例如 512m , 2g )。
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option or in your default properties file.
注意事项:在客户端模式下,不能直接在应用程序中通过 SparkConf 设置此配置,因为驱动程序JVM已经在该点启动。相反,请通过 --driver-memory 命令行选项或在默认属性文件中进行设置。
1.1.1
spark.driver.memoryOverhead driverMemory * spark.driver.memoryOverheadFactor, with minimum of 384
driverMemory * spark.driver.memoryOverheadFactor ,最小384
Amount of non-heap memory to be allocated per driver process in cluster mode, in MiB unless otherwise specified. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the container size (typically 6-10%). This option is currently supported on YARN, Mesos and Kubernetes. Note: Non-heap memory includes off-heap memory (when spark.memory.offHeap.enabled=true) and memory used by other driver processes (e.g. python process that goes with a PySpark driver) and memory used by other non-driver processes running in the same container. The maximum memory size of container to running driver is determined by the sum of spark.driver.memoryOverhead and spark.driver.memory.
在群集模式下为每个驱动程序进程分配的非堆内存量,除非另有指定,否则以MiB为单位。这是一种内存,用于处理VM开销、内部字符串、其他本机开销等。这往往会随着容器大小的增加而增加(通常为6-10%)。此选项目前在YARN、Mesos和Kubernetes上受支持。注意事项:非堆内存包括堆外内存(当 spark.memory.offHeap.enabled=true 时)和其他驱动程序进程使用的内存(例如PySpark驱动程序附带的python进程)以及在同一容器中运行的其他非驱动程序进程使用的内存。容器到运行驱动程序的最大内存大小由 spark.driver.memoryOverhead 和 spark.driver.memory 之和决定。
2.3.0
spark.driver.memoryOverheadFactor 0.10 Fraction of driver memory to be allocated as additional non-heap memory per driver process in cluster mode. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the container size. This value defaults to 0.10 except for Kubernetes non-JVM jobs, which defaults to 0.40. This is done as non-JVM tasks need more non-JVM heap space and such tasks commonly fail with "Memory Overhead Exceeded" errors. This preempts this error with a higher default. This value is ignored if spark.driver.memoryOverhead is set directly.
在群集模式下,要为每个驱动程序进程分配作为附加非堆内存的驱动程序内存部分。这是内存,用于处理VM开销、内部字符串、其他本机开销等。这往往会随着容器大小的增加而增加。此值默认为0.10,但Kubernetes非JVM作业除外,默认值为0.40。这样做是因为非JVM任务需要更多的非JVM堆空间,并且此类任务通常会因“内存开销超出”错误而失败。这将使用更高的默认值来抢占此错误。如果直接设置 spark.driver.memoryOverhead ,则忽略此值。
3.3.0
spark.driver.resource.{resourceName}.amount 0 Amount of a particular resource type to use on the driver. If this is used, you must also specify the spark.driver.resource.{resourceName}.discoveryScript for the driver to find the resource on startup.
用于驱动因素的特定资源类型的数量。如果使用此选项,您还必须指定 spark.driver.resource.{resourceName}.discoveryScript ,以便驱动程序在启动时查找资源。
3.0.0
spark.driver.resource.{resourceName}.discoveryScript None 没有一 A script for the driver to run to discover a particular resource type. This should write to STDOUT a JSON string in the format of the ResourceInformation class. This has a name and an array of addresses. For a client-submitted driver, discovery script must assign different resource addresses to this driver comparing to other drivers on the same host.
驱动程序运行以发现特定资源类型的脚本。这应该以ResourceInformation类的格式向STDOUT写入JSON字符串。它有一个名称和一个地址数组。对于客户端提交的驱动程序,与同一主机上的其他驱动程序相比,发现脚本必须为此驱动程序分配不同的资源地址。
3.0.0
spark.driver.resource.{resourceName}.vendor None 没有一 Vendor of the resources to use for the driver. This option is currently only supported on Kubernetes and is actually both the vendor and domain following the Kubernetes device plugin naming convention. (e.g. For GPUs on Kubernetes this config would be set to nvidia.com or amd.com)
要用于驱动程序的资源的供应商。此选项目前仅在Kubernetes上受支持,实际上是遵循Kubernetes设备插件命名约定的供应商和域。(e.g.对于Kubernetes上的GPU,此配置将设置为nvidia.com或amd.com)
3.0.0
spark.resources.discoveryPlugin org.apache.spark.resource.ResourceDiscoveryScriptPlugin Comma-separated list of class names implementing org.apache.spark.api.resource.ResourceDiscoveryPlugin to load into the application. This is for advanced users to replace the resource discovery class with a custom implementation. Spark will try each class specified until one of them returns the resource information for that resource. It tries the discovery script last if none of the plugins return information for that resource.
要加载到应用程序中的实现org.apache.spark.API.resource.ResourceDiscoveryPlugin的类名的逗号分隔列表。高级用户可以使用自定义实现替换资源发现类。Spark将尝试指定的每个类,直到其中一个返回该资源的资源信息。如果没有插件返回该资源的信息,它将最后尝试发现脚本。
3.0.0
spark.executor.memory 1g Amount of memory to use per executor process, in the same format as JVM memory strings with a size unit suffix ("k", "m", "g" or "t") (e.g. 512m2g).
每个执行器进程使用的内存量,格式与JVM内存字符串相同,带有大小单位后缀(“k”、“m”、“g”或“t”)(例如 512m 、 2g )。
0.7.0
spark.executor.pyspark.memory Not set 未设置 The amount of memory to be allocated to PySpark in each executor, in MiB unless otherwise specified. If set, PySpark memory for an executor will be limited to this amount. If not set, Spark will not limit Python's memory use and it is up to the application to avoid exceeding the overhead memory space shared with other non-JVM processes. When PySpark is run in YARN or Kubernetes, this memory is added to executor resource requests.
在每个执行器中分配给PySpark的内存量,除非另有说明,否则以MiB为单位。如果设置了这个值,那么PySpark执行程序的内存将被限制在这个值内。如果没有设置,Spark将不会限制Python的内存使用,这取决于应用程序,以避免超过与其他非JVM进程共享的内存空间。当PySpark在YARN或Kubernetes中运行时,此内存将添加到执行器资源请求中。
Note: This feature is dependent on Python's `resource` module; therefore, the behaviors and limitations are inherited. For instance, Windows does not support resource limiting and actual resource is not limited on MacOS.
注意:此功能依赖于Python的“resource”模块;因此,行为和限制是继承的。例如,Windows不支持资源限制,实际资源在MacOS上不受限制。
2.4.0
spark.executor.memoryOverhead executorMemory * spark.executor.memoryOverheadFactor, with minimum of 384
executorMemory * spark.executor.memoryOverheadFactor ,最小384
Amount of additional memory to be allocated per executor process, in MiB unless otherwise specified. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%). This option is currently supported on YARN and Kubernetes.
每个执行器进程分配的额外内存量,除非另有指定,否则以MiB为单位。这是内存,用于处理VM开销、内部字符串、其他本机开销等。这往往会随着执行器的大小而增长(通常为6-10%)。此选项目前在YARN和Kubernetes上受支持。
Note: Additional memory includes PySpark executor memory (when spark.executor.pyspark.memory is not configured) and memory used by other non-executor processes running in the same container. The maximum memory size of container to running executor is determined by the sum of spark.executor.memoryOverheadspark.executor.memoryspark.memory.offHeap.size and spark.executor.pyspark.memory.
注意事项:额外的内存包括PySpark执行器内存(当没有配置 spark.executor.pyspark.memory 时)和在同一容器中运行的其他非执行器进程使用的内存。容器到运行执行器的最大内存大小由 spark.executor.memoryOverhead 、 spark.executor.memory 、 spark.memory.offHeap.size 和 spark.executor.pyspark.memory 之和决定。
2.3.0
spark.executor.memoryOverheadFactor 0.10 Fraction of executor memory to be allocated as additional non-heap memory per executor process. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the container size. This value defaults to 0.10 except for Kubernetes non-JVM jobs, which defaults to 0.40. This is done as non-JVM tasks need more non-JVM heap space and such tasks commonly fail with "Memory Overhead Exceeded" errors. This preempts this error with a higher default. This value is ignored if spark.executor.memoryOverhead is set directly.
每个执行程序进程要分配的作为额外非堆内存的执行程序内存部分。这是内存,用于处理VM开销、内部字符串、其他本机开销等。这往往会随着容器大小的增加而增加。此值默认为0.10,但Kubernetes非JVM作业除外,默认值为0.40。这样做是因为非JVM任务需要更多的非JVM堆空间,并且此类任务通常会因“内存开销超出”错误而失败。这将使用更高的默认值来抢占此错误。如果直接设置 spark.executor.memoryOverhead ,则忽略此值。
3.3.0
spark.executor.resource.{resourceName}.amount 0 Amount of a particular resource type to use per executor process. If this is used, you must also specify the spark.executor.resource.{resourceName}.discoveryScript for the executor to find the resource on startup.
每个执行器进程使用的特定资源类型的数量。如果使用此选项,则还必须指定 spark.executor.resource.{resourceName}.discoveryScript ,以便执行程序在启动时查找资源。
3.0.0
spark.executor.resource.{resourceName}.discoveryScript None 没有一 A script for the executor to run to discover a particular resource type. This should write to STDOUT a JSON string in the format of the ResourceInformation class. This has a name and an array of addresses.
执行程序运行以发现特定资源类型的脚本。这应该以ResourceInformation类的格式向STDOUT写入JSON字符串。它有一个名称和一个地址数组。
3.0.0
spark.executor.resource.{resourceName}.vendor None 没有一 Vendor of the resources to use for the executors. This option is currently only supported on Kubernetes and is actually both the vendor and domain following the Kubernetes device plugin naming convention. (e.g. For GPUs on Kubernetes this config would be set to nvidia.com or amd.com)
要用于执行器的资源的供应商。此选项目前仅在Kubernetes上受支持,实际上是遵循Kubernetes设备插件命名约定的供应商和域。(e.g.对于Kubernetes上的GPU,此配置将设置为nvidia.com或amd.com)
3.0.0
spark.extraListeners (none) (无) A comma-separated list of classes that implement SparkListener; when initializing SparkContext, instances of these classes will be created and registered with Spark's listener bus. If a class has a single-argument constructor that accepts a SparkConf, that constructor will be called; otherwise, a zero-argument constructor will be called. If no valid constructor can be found, the SparkContext creation will fail with an exception.
一个逗号分隔的类列表,在初始化SparkContext时实现 SparkListener ;,这些类的实例将被创建并注册到Spark的侦听器总线。如果一个类有一个接受SparkConf的单参数构造函数,那么将调用该构造函数;否则,将调用零参数构造函数。如果找不到有效的构造函数,SparkContext创建将失败并出现异常。
1.3.0
spark.local.dir /tmp /临时 Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks.
Spark中用于“暂存”空间的目录,包括存储在磁盘上的map输出文件和RDD。这应该是在一个快速,在您的系统本地磁盘。它也可以是不同磁盘上的多个目录的逗号分隔列表。
Note: This will be overridden by SPARK_LOCAL_DIRS (Standalone), MESOS_SANDBOX (Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.
注意事项:这将被集群管理器设置的SPARK_SANDBOX(Mesos)或SPARK_DIRS(YARN)环境变量覆盖。
0.5.0
spark.logConf false 假 Logs the effective SparkConf as INFO when a SparkContext is started.
当SparkContext启动时,将有效的SparkConf设置为INFO。
0.9.0
spark.master (none) (无) The cluster manager to connect to. See the list of allowed master URL's.
要连接到的群集管理器。查看允许的主URL列表。
0.9.0
spark.submit.deployMode (none) (无) The deploy mode of Spark driver program, either "client" or "cluster", Which means to launch driver program locally ("client") or remotely ("cluster") on one of the nodes inside the cluster.
Spark驱动程序的部署模式,“客户端”或“集群”,这意味着在集群内的一个节点上本地(“客户端”)或远程(“集群”)启动驱动程序。
1.5.0
spark.log.callerContext (none) (无) Application information that will be written into Yarn RM log/HDFS audit log when running on Yarn/HDFS. Its length depends on the Hadoop configuration hadoop.caller.context.max.size. It should be concise, and typically can have up to 50 characters.
在Yarn/HDFS上运行时将写入Yarn RM日志/HDFS审计日志的应用程序信息。它的长度取决于Hadoop配置 hadoop.caller.context.max.size 。它应该简洁,通常最多可以有50个字符。
2.2.0
spark.log.level (none) (无) When set, overrides any user-defined log settings as if calling SparkContext.setLogLevel() at Spark startup. Valid log levels include: "ALL", "DEBUG", "ERROR", "FATAL", "INFO", "OFF", "TRACE", "WARN".
设置后,覆盖任何用户定义的日志设置,就像在Spark启动时调用 SparkContext.setLogLevel() 一样。有效的日志级别包括:“所有”、“调试”、“错误”、“致命”、“信息”、“关闭”、“跟踪”、“警告”。
3.5.0
spark.driver.supervise false 假 If true, restarts the driver automatically if it fails with a non-zero exit status. Only has effect in Spark standalone mode or Mesos cluster deploy mode.
如果为true,则在驱动程序以非零退出状态失败时自动重新启动驱动程序。仅在Spark独立模式或Mesos集群部署模式下有效。
1.3.0
spark.driver.log.dfsDir (none) (无) Base directory in which Spark driver logs are synced, if spark.driver.log.persistToDfs.enabled is true. Within this base directory, each application logs the driver logs to an application specific file. Users may want to set this to a unified location like an HDFS directory so driver log files can be persisted for later usage. This directory should allow any Spark user to read/write files and the Spark History Server user to delete files. Additionally, older logs from this directory are cleaned by the Spark History Server if spark.history.fs.driverlog.cleaner.enabled is true and, if they are older than max age configured by setting spark.history.fs.driverlog.cleaner.maxAge.
如果 spark.driver.log.persistToDfs.enabled 为true,则同步Spark驱动程序日志的基本目录。在此基本目录中,每个应用程序将驱动程序日志记录到应用程序特定的文件中。用户可能希望将其设置为一个统一的位置,如HDFS目录,以便驱动程序日志文件可以持久化以供以后使用。这个目录应该允许任何Spark用户读/写文件,允许Spark历史服务器用户删除文件。此外,如果 spark.history.fs.driverlog.cleaner.enabled 为true,并且如果它们比设置 spark.history.fs.driverlog.cleaner.maxAge 配置的最大年龄更早,则Spark历史服务器会清理此目录中的旧日志。
3.0.0
spark.driver.log.persistToDfs.enabled false 假 If true, spark application running in client mode will write driver logs to a persistent storage, configured in spark.driver.log.dfsDir. If spark.driver.log.dfsDir is not configured, driver logs will not be persisted. Additionally, enable the cleaner by setting spark.history.fs.driverlog.cleaner.enabled to true in Spark History Server.
如果为true,则在客户端模式下运行的spark应用程序将驱动程序日志写入持久存储,配置在 spark.driver.log.dfsDir 中。如果未配置 spark.driver.log.dfsDir ,则驱动程序日志将不会持久化。此外,通过在Spark历史服务器中将 spark.history.fs.driverlog.cleaner.enabled 设置为true来启用清理器。
3.0.0
spark.driver.log.layout %d{yy/MM/dd HH:mm:ss.SSS} %t %p %c{1}: %m%n%ex
%d{yy/MM/dd HH:mm:ss.SSS} %t %p %c{1}:%m%n%ex
The layout for the driver logs that are synced to spark.driver.log.dfsDir. If this is not configured, it uses the layout for the first appender defined in log4j2.properties. If that is also not configured, driver logs use the default layout.
同步到 spark.driver.log.dfsDir 的驱动程序日志的布局。如果没有配置,它将使用log4j2.properties中定义的第一个appender的布局。如果也未配置,驱动程序日志将使用默认布局。
3.0.0
spark.driver.log.allowErasureCoding false 假 Whether to allow driver logs to use erasure coding. On HDFS, erasure coded files will not update as quickly as regular replicated files, so they make take longer to reflect changes written by the application. Note that even if this is true, Spark will still not force the file to use erasure coding, it will simply use file system defaults.
是否允许驱动程序日志使用擦除编码。在HDFS上,擦除编码文件不会像常规复制文件那样快速更新,因此它们需要更长的时间来反映应用程序写入的更改。请注意,即使这是真的,Spark仍然不会强制文件使用擦除编码,它只会使用文件系统默认值。
3.0.0
spark.decommission.enabled false 假 When decommission enabled, Spark will try its best to shut down the executor gracefully. Spark will try to migrate all the RDD blocks (controlled by spark.storage.decommission.rddBlocks.enabled) and shuffle blocks (controlled by spark.storage.decommission.shuffleBlocks.enabled) from the decommissioning executor to a remote executor when spark.storage.decommission.enabled is enabled. With decommission enabled, Spark will also decommission an executor instead of killing when spark.dynamicAllocation.enabled enabled.
当停用启用时,Spark会尽最大努力优雅地关闭执行器。当启用 spark.storage.decommission.enabled 时,Spark将尝试将所有RDD块(由 spark.storage.decommission.rddBlocks.enabled 控制)和shuffle块(由 spark.storage.decommission.shuffleBlocks.enabled 控制)从退役执行器迁移到远程执行器。启用停用后,Spark也会停用执行器,而不是在启用 spark.dynamicAllocation.enabled 时杀死执行器。
3.1.0
spark.executor.decommission.killInterval (none) (无) Duration after which a decommissioned executor will be killed forcefully by an outside (e.g. non-spark) service.
一段时间,在这段时间之后,一个退役的执行器将被外部(例如非火花)服务强制杀死。
3.1.0
spark.executor.decommission.forceKillTimeout (none) (无) Duration after which a Spark will force a decommissioning executor to exit. This should be set to a high value in most situations as low values will prevent block migrations from having enough time to complete.
Spark将强制停用执行器退出的持续时间。在大多数情况下,应将此值设置为较高的值,因为较低的值将使数据块迁移无法有足够的时间完成。
3.2.0
spark.executor.decommission.signal PWR The signal that used to trigger the executor to start decommission.
用来触发执行者开始解除任务的信号。
3.2.0

Apart from these, the following properties are also available, and may be useful in some situations:
除此之外,还提供了以下属性,并且在某些情况下可能有用:

Runtime Environment 运行时环境 

Property Name 属性名称 Default 默认 Meaning 意义 Since Version 从版本
spark.driver.extraClassPath (none) (无) Extra classpath entries to prepend to the classpath of the driver.
额外的类路径条目,以前置到驱动程序的类路径。
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
注意事项:在客户端模式下,不能直接在应用程序中通过 SparkConf 设置此配置,因为驱动程序JVM已经在该点启动。相反,请通过 --driver-class-path 命令行选项或在默认属性文件中进行设置。
1.0.0
spark.driver.defaultJavaOptions (none) (无) A string of default JVM options to prepend to spark.driver.extraJavaOptions. This is intended to be set by administrators. For instance, GC settings or other logging. Note that it is illegal to set maximum heap size (-Xmx) settings with this option. Maximum heap size settings can be set with spark.driver.memory in the cluster mode and through the --driver-memory command line option in the client mode.
一个默认JVM选项的字符串,用于前缀到 spark.driver.extraJavaOptions 。这是由管理员设置的。例如,GC设置或其他日志记录。请注意,使用此选项设置最大堆大小(-Xmx)是非法的。最大堆大小设置可以在集群模式下使用 spark.driver.memory 设置,在客户端模式下通过 --driver-memory 命令行选项设置。
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-java-options command line option or in your default properties file.
注意事项:在客户端模式下,不能直接在应用程序中通过 SparkConf 设置此配置,因为驱动程序JVM已经在该点启动。相反,请通过 --driver-java-options 命令行选项或在默认属性文件中进行设置。
3.0.0
spark.driver.extraJavaOptions (none) (无) A string of extra JVM options to pass to the driver. This is intended to be set by users. For instance, GC settings or other logging. Note that it is illegal to set maximum heap size (-Xmx) settings with this option. Maximum heap size settings can be set with spark.driver.memory in the cluster mode and through the --driver-memory command line option in the client mode.
要传递给驱动程序的额外JVM选项的字符串。这是由用户设置的。例如,GC设置或其他日志记录。请注意,使用此选项设置最大堆大小(-Xmx)是非法的。最大堆大小设置可以在集群模式下使用 spark.driver.memory 设置,在客户端模式下通过 --driver-memory 命令行选项设置。
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-java-options command line option or in your default properties file. spark.driver.defaultJavaOptions will be prepended to this configuration.
注意事项:在客户端模式下,不能直接在应用程序中通过 SparkConf 设置此配置,因为驱动程序JVM已经在该点启动。相反,请通过 --driver-java-options 命令行选项或在默认属性文件中进行设置。 spark.driver.defaultJavaOptions 将被前置到此配置。
1.0.0
spark.driver.extraLibraryPath (none) (无) Set a special library path to use when launching the driver JVM.
设置启动驱动程序JVM时使用的特殊库路径。
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-library-path command line option or in your default properties file.
注意事项:在客户端模式下,不能直接在应用程序中通过 SparkConf 设置此配置,因为驱动程序JVM已经在该点启动。相反,请通过 --driver-library-path 命令行选项或在默认属性文件中进行设置。
1.0.0
spark.driver.userClassPathFirst false 假 (Experimental) Whether to give user-added jars precedence over Spark's own jars when loading classes in the driver. This feature can be used to mitigate conflicts between Spark's dependencies and user dependencies. It is currently an experimental feature. This is used in cluster mode only.
(实验)在驱动程序中加载类时,是否给予用户添加的jar优先于Spark自己的jar。此功能可用于缓解Spark依赖项和用户依赖项之间的冲突。目前,这是一个实验性的功能。这仅在群集模式下使用。
1.3.0
spark.executor.extraClassPath (none) (无) Extra classpath entries to prepend to the classpath of executors. This exists primarily for backwards-compatibility with older versions of Spark. Users typically should not need to set this option.
额外的类路径条目,以作为executors类路径的前缀。这主要是为了向后兼容旧版本的Spark。用户通常不需要设置此选项。
1.0.0
spark.executor.defaultJavaOptions (none) (无) A string of default JVM options to prepend to spark.executor.extraJavaOptions. This is intended to be set by administrators. For instance, GC settings or other logging. Note that it is illegal to set Spark properties or maximum heap size (-Xmx) settings with this option. Spark properties should be set using a SparkConf object or the spark-defaults.conf file used with the spark-submit script. Maximum heap size settings can be set with spark.executor.memory. The following symbols, if present will be interpolated: will be replaced by application ID and will be replaced by executor ID. For example, to enable verbose gc logging to a file named for the executor ID of the app in /tmp, pass a 'value' of: -verbose:gc -Xloggc:/tmp/-.gc
一个默认JVM选项的字符串,用于前缀到 spark.executor.extraJavaOptions 。这是由管理员设置的。例如,GC设置或其他日志记录。请注意,使用此选项设置Spark属性或最大堆大小(-Xmx)设置是非法的。Spark属性应该使用SparkConf对象或spark-submit脚本中使用的spark-defaults.conf文件来设置。可以使用spark. executor. memory设置最大堆大小。以下符号(如果存在)将被插入:将被应用程序ID替换,并将被执行器ID替换。例如,要启用详细gc日志记录到/tmp中以应用程序的执行器ID命名的文件,请传递以下“值”: -verbose:gc -Xloggc:/tmp/-.gc
3.0.0
spark.executor.extraJavaOptions (none) (无) A string of extra JVM options to pass to executors. This is intended to be set by users. For instance, GC settings or other logging. Note that it is illegal to set Spark properties or maximum heap size (-Xmx) settings with this option. Spark properties should be set using a SparkConf object or the spark-defaults.conf file used with the spark-submit script. Maximum heap size settings can be set with spark.executor.memory. The following symbols, if present will be interpolated: will be replaced by application ID and will be replaced by executor ID. For example, to enable verbose gc logging to a file named for the executor ID of the app in /tmp, pass a 'value' of: -verbose:gc -Xloggc:/tmp/-.gc spark.executor.defaultJavaOptions will be prepended to this configuration.
传递给执行程序的额外JVM选项的字符串。这是由用户设置的。例如,GC设置或其他日志记录。请注意,使用此选项设置Spark属性或最大堆大小(-Xmx)设置是非法的。Spark属性应该使用SparkConf对象或spark-submit脚本中使用的spark-defaults.conf文件来设置。可以使用spark. executor. memory设置最大堆大小。以下符号(如果存在)将被插入:将被应用程序ID替换,并将被执行器ID替换。例如,要启用详细gc日志记录到/tmp中以应用程序的执行器ID命名的文件,请传递一个'value': -verbose:gc -Xloggc:/tmp/-.gc spark.executor.defaultJavaOptions 将被前置到此配置。
1.0.0
spark.executor.extraLibraryPath (none) (无) Set a special library path to use when launching executor JVM's.
设置启动执行器JVM时使用的特殊库路径。
1.0.0
spark.executor.logs.rolling.maxRetainedFiles (none) (无) Sets the number of latest rolling log files that are going to be retained by the system. Older log files will be deleted. Disabled by default.
设置系统将保留的最新滚动日志文件的数量。旧的日志文件将被删除。默认情况下禁用。
1.1.0
spark.executor.logs.rolling.enableCompression false 假 Enable executor log compression. If it is enabled, the rolled executor logs will be compressed. Disabled by default.
启用执行器日志压缩。如果启用,则会压缩滚动的执行器日志。默认情况下禁用。
2.0.2
spark.executor.logs.rolling.maxSize (none) (无) Set the max size of the file in bytes by which the executor logs will be rolled over. Rolling is disabled by default. See spark.executor.logs.rolling.maxRetainedFiles for automatic cleaning of old logs.
以字节为单位设置文件的最大大小,执行器日志将按此大小滚动。默认情况下禁用滚动。有关旧日志的自动清理,请参见 spark.executor.logs.rolling.maxRetainedFiles 。
1.4.0
spark.executor.logs.rolling.strategy (none) (无) Set the strategy of rolling of executor logs. By default it is disabled. It can be set to "time" (time-based rolling) or "size" (size-based rolling). For "time", use spark.executor.logs.rolling.time.interval to set the rolling interval. For "size", use spark.executor.logs.rolling.maxSize to set the maximum file size for rolling.
设置执行器日志的滚动策略。默认情况下,它是禁用的。它可以设置为“时间”(基于时间的滚动)或“大小”(基于大小的滚动)。对于“时间”,使用 spark.executor.logs.rolling.time.interval 设置滚动间隔。对于“size”,使用 spark.executor.logs.rolling.maxSize 设置滚动的最大文件大小。
1.1.0
spark.executor.logs.rolling.time.interval daily 每日 Set the time interval by which the executor logs will be rolled over. Rolling is disabled by default. Valid values are dailyhourlyminutely or any interval in seconds. See spark.executor.logs.rolling.maxRetainedFiles for automatic cleaning of old logs.
设置执行程序日志的滚动时间间隔。默认情况下禁用滚动。有效值为 daily 、 hourly 、 minutely 或任何以秒为单位的间隔。有关旧日志的自动清理,请参见 spark.executor.logs.rolling.maxRetainedFiles 。
1.1.0
spark.executor.userClassPathFirst false 假 (Experimental) Same functionality as spark.driver.userClassPathFirst, but applied to executor instances.
(实验)与 spark.driver.userClassPathFirst 相同的功能,但应用于执行器实例。
1.3.0
spark.executorEnv.[EnvironmentVariableName] (none) (无) Add the environment variable specified by EnvironmentVariableName to the Executor process. The user can specify multiple of these to set multiple environment variables.
将 EnvironmentVariableName 指定的环境变量添加到Executor进程。用户可以指定其中的多个来设置多个环境变量。
0.9.0
spark.redaction.regex (?i)secret|password|token|access[.]key
(?i)秘密|密码|令牌|access[.]关键
Regex to decide which Spark configuration properties and environment variables in driver and executor environments contain sensitive information. When this regex matches a property key or value, the value is redacted from the environment UI and various logs like YARN and event logs.
Regex决定驱动程序和执行器环境中的哪些Spark配置属性和环境变量包含敏感信息。当这个正则表达式匹配一个属性键或值时,该值将从环境UI和各种日志(如YARN和事件日志)中编辑。
2.1.2
spark.redaction.string.regex (none) (无) Regex to decide which parts of strings produced by Spark contain sensitive information. When this regex matches a string part, that string part is replaced by a dummy value. This is currently used to redact the output of SQL explain commands.
Regex决定Spark生成的字符串的哪些部分包含敏感信息。当这个正则表达式匹配一个字符串部分时,该字符串部分将被一个伪值替换。这目前用于编辑SQL解释命令的输出。
2.2.0
spark.python.profile false 假 Enable profiling in Python worker, the profile result will show up by sc.show_profiles(), or it will be displayed before the driver exits. It also can be dumped into disk by sc.dump_profiles(path). If some of the profile results had been displayed manually, they will not be displayed automatically before driver exiting. By default the pyspark.profiler.BasicProfiler will be used, but this can be overridden by passing a profiler class in as a parameter to the SparkContext constructor.
在Python worker中启用profiling,配置文件结果将显示在 sc.show_profiles() ,或者在驱动程序退出之前显示。它也可以通过 sc.dump_profiles(path) 转储到磁盘中。如果某些配置文件结果已手动显示,则在驱动程序退出之前不会自动显示。默认情况下,将使用 pyspark.profiler.BasicProfiler ,但这可以通过将分析器类作为参数传递给 SparkContext 构造函数来覆盖。
1.2.0
spark.python.profile.dump (none) (无) The directory which is used to dump the profile result before driver exiting. The results will be dumped as separated file for each RDD. They can be loaded by pstats.Stats(). If this is specified, the profile result will not be displayed automatically.
在驱动程序退出之前用于转储配置文件结果的目录。结果将作为每个RDD的单独文件转储。可以通过 pstats.Stats() 加载。如果指定此选项,则不会自动显示配置文件结果。
1.2.0
spark.python.worker.memory 512m Amount of memory to use per python worker process during aggregation, in the same format as JVM memory strings with a size unit suffix ("k", "m", "g" or "t") (e.g. 512m2g). If the memory used during aggregation goes above this amount, it will spill the data into disks.
聚合期间每个python工作进程使用的内存量,格式与JVM内存字符串相同,带有大小单位后缀(“k”、“m”、“g”或“t”)(例如 512m 、 2g )。如果聚合期间使用的内存超过此数量,则会将数据溢出到磁盘中。
1.1.0
spark.python.worker.reuse true 真 Reuse Python worker or not. If yes, it will use a fixed number of Python workers, does not need to fork() a Python process for every task. It will be very useful if there is a large broadcast, then the broadcast will not need to be transferred from JVM to Python worker for every task.
是否重用Python worker。如果是,它将使用固定数量的Python工作线程,不需要为每个任务都使用fork()Python进程。如果有一个大的广播,这将是非常有用的,那么广播将不需要从JVM传输到每个任务的Python worker。
1.2.0
spark.files Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed.
要放置在每个执行器的工作目录中的文件的逗号分隔列表。允许使用Globs。
1.0.0
spark.submit.pyFiles Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps. Globs are allowed.
要放置在Python应用程序的PYTHONPATH上的.zip、.egg或.py文件的逗号分隔列表。允许使用Globs。
1.0.1
spark.jars Comma-separated list of jars to include on the driver and executor classpaths. Globs are allowed.
要包含在驱动程序和执行器类路径上的jar的逗号分隔列表。允许使用Globs。
0.9.0
spark.jars.packages Comma-separated list of Maven coordinates of jars to include on the driver and executor classpaths. The coordinates should be groupId:artifactId:version. If spark.jars.ivySettings is given artifacts will be resolved according to the configuration in the file, otherwise artifacts will be searched for in the local maven repo, then maven central and finally any additional remote repositories given by the command-line option --repositories. For more details, see Advanced Dependency Management.
要包含在驱动程序和执行器类路径中的jar的Maven坐标的逗号分隔列表。坐标应为groupId:artifactId:version。如果给定 spark.jars.ivySettings ,则工件将根据文件中的配置进行解析,否则工件将在本地maven repo中搜索,然后是maven central,最后是命令行选项 --repositories 给出的任何其他远程存储库。有关详细信息,请参阅高级部门管理。
1.5.0
spark.jars.excludes Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in spark.jars.packages to avoid dependency conflicts.
groupId:artifactId的逗号分隔列表,用于在解析 spark.jars.packages 中提供的依赖时排除,以避免依赖冲突。
1.5.0
spark.jars.ivy Path to specify the Ivy user directory, used for the local Ivy cache and package files from spark.jars.packages. This will override the Ivy property ivy.default.ivy.user.dir which defaults to ~/.ivy2.
指定Ivy用户目录的路径,用于本地Ivy缓存和来自 spark.jars.packages 的包文件。这将覆盖Ivy属性 ivy.default.ivy.user.dir ,默认值为~/.ivy2。
1.3.0
spark.jars.ivySettings Path to an Ivy settings file to customize resolution of jars specified using spark.jars.packages instead of the built-in defaults, such as maven central. Additional repositories given by the command-line option --repositories or spark.jars.repositories will also be included. Useful for allowing Spark to resolve artifacts from behind a firewall e.g. via an in-house artifact server like Artifactory. Details on the settings file format can be found at Settings Files. Only paths with file:// scheme are supported. Paths without a scheme are assumed to have a file:// scheme.
Ivy设置文件的路径,用于自定义使用 spark.jars.packages 而不是内置默认值(如maven central)指定的jar的分辨率。命令行选项 --repositories 或 spark.jars.repositories 提供的其他存储库也将包括在内。用于允许Spark从防火墙后面解析工件,例如通过内部工件服务器(如Artifactory)。有关设置文件格式的详细信息,请参见设置文件。仅支持具有 file:// 方案的路径。假设没有方案的路径具有 file:// 方案。

When running in YARN cluster mode, this file will also be localized to the remote driver for dependency resolution within SparkContext#addJar
当在YARN集群模式下运行时,此文件也将本地化到远程驱动程序,以在 SparkContext#addJar 内进行依赖关系解析

2.2.0
spark.jars.repositories Comma-separated list of additional remote repositories to search for the maven coordinates given with --packages or spark.jars.packages.
以逗号分隔的其他远程存储库列表,用于搜索使用 --packages 或 spark.jars.packages 给出的maven坐标。
2.3.0
spark.archives Comma-separated list of archives to be extracted into the working directory of each executor. .jar, .tar.gz, .tgz and .zip are supported. You can specify the directory name to unpack via adding # after the file name to unpack, for example, file.zip#directory. This configuration is experimental.
要提取到每个执行者的工作目录中的归档文件的逗号分隔列表。支持.jar、. tar.gz、.tgz和.zip。您可以通过在要解压缩的文件名后添加 # 来指定要解压缩的目录名,例如 file.zip#directory 。这种配置是实验性的。
3.1.0
spark.pyspark.driver.python Python binary executable to use for PySpark in driver. (default is spark.pyspark.python)
Python二进制可执行文件用于PySpark驱动程序。(默认为 spark.pyspark.python )
2.1.0
spark.pyspark.python Python binary executable to use for PySpark in both driver and executors.
Python二进制可执行文件,用于PySpark的驱动程序和执行器。
2.1.0

Shuffle Behavior 洗牌行为 

Property Name 属性名称 Default 默认 Meaning 意义 Since Version
spark.reducer.maxSizeInFlight 48m Maximum size of map outputs to fetch simultaneously from each reduce task, in MiB unless otherwise specified. Since each output requires us to create a buffer to receive it, this represents a fixed memory overhead per reduce task, so keep it small unless you have a large amount of memory.
从每个reduce任务同时获取的map输出的最大大小,除非另有说明,否则以MiB为单位。由于每个输出都需要我们创建一个缓冲区来接收它,这表示每个reduce任务的内存开销是固定的,所以保持它小,除非你有大量的内存。
1.4.0
spark.reducer.maxReqsInFlight Int.MaxValue This configuration limits the number of remote requests to fetch blocks at any given point. When the number of hosts in the cluster increase, it might lead to very large number of inbound connections to one or more nodes, causing the workers to fail under load. By allowing it to limit the number of fetch requests, this scenario can be mitigated.
此配置限制了在任何给定点获取块的远程请求的数量。当集群中的主机数量增加时,可能会导致到一个或多个节点的入站连接数量非常大,从而导致工作进程在负载下失败。通过允许它限制获取请求的数量,可以缓解这种情况。
2.0.0
spark.reducer.maxBlocksInFlightPerAddress Int.MaxValue This configuration limits the number of remote blocks being fetched per reduce task from a given host port. When a large number of blocks are being requested from a given address in a single fetch or simultaneously, this could crash the serving executor or Node Manager. This is especially useful to reduce the load on the Node Manager when external shuffle is enabled. You can mitigate this issue by setting it to a lower value.
此配置限制了每个reduce任务从给定主机端口获取的远程块的数量。当在一次或同时从给定地址请求大量块时,这可能会使服务执行器或节点管理器崩溃。这对于在启用外部随机播放时减少节点管理器上的负载特别有用。您可以通过将其设置为较低的值来缓解此问题。
2.2.1
spark.shuffle.compress true 真 Whether to compress map output files. Generally a good idea. Compression will use spark.io.compression.codec.
是否压缩地图输出文件。总的来说是个好主意。压缩将使用 spark.io.compression.codec 。
0.6.0
spark.shuffle.file.buffer 32k Size of the in-memory buffer for each shuffle file output stream, in KiB unless otherwise specified. These buffers reduce the number of disk seeks and system calls made in creating intermediate shuffle files.
每个随机文件输出流的内存缓冲区大小,除非另有指定,否则以KiB为单位。这些缓冲区减少了在创建中间随机文件时进行的磁盘寻道和系统调用的数量。
1.4.0
spark.shuffle.unsafe.file.output.buffer 32k The file system for this buffer size after each partition is written in unsafe shuffle writer. In KiB unless otherwise specified.
此缓冲区大小的文件系统后,每个分区都写入不安全的shuffle writer。除非另有说明,否则单位为KiB。
2.3.0
spark.shuffle.spill.diskWriteBufferSize 1024 * 1024 The buffer size, in bytes, to use when writing the sorted records to an on-disk file.
将已排序的记录写入磁盘文件时使用的缓冲区大小(以字节为单位)。
2.3.0
spark.shuffle.io.maxRetries 3 (Netty only) Fetches that fail due to IO-related exceptions are automatically retried if this is set to a non-zero value. This retry logic helps stabilize large shuffles in the face of long GC pauses or transient network connectivity issues.
(仅限Netty)如果将此设置为非零值,则会自动重试由于IO相关异常而失败的提取。这种重试逻辑有助于在长时间GC暂停或暂时网络连接问题面前稳定大的洗牌。
1.2.0
spark.shuffle.io.numConnectionsPerPeer 1 (Netty only) Connections between hosts are reused in order to reduce connection buildup for large clusters. For clusters with many hard disks and few hosts, this may result in insufficient concurrency to saturate all disks, and so users may consider increasing this value.
(仅限Netty)主机之间的连接被重用,以减少大型群集的连接积累。对于具有许多硬盘和很少主机的群集,这可能导致并发性不足,无法使所有磁盘饱和,因此用户可以考虑增加此值。
1.2.1
spark.shuffle.io.preferDirectBufs true 真 (Netty only) Off-heap buffers are used to reduce garbage collection during shuffle and cache block transfer. For environments where off-heap memory is tightly limited, users may wish to turn this off to force all allocations from Netty to be on-heap.
(仅限Netty)堆外缓冲区用于减少洗牌和缓存块传输期间的垃圾收集。对于堆外内存受到严格限制的环境,用户可能希望关闭此功能,以强制Netty的所有分配都在堆上。
1.2.0
spark.shuffle.io.retryWait 5s (Netty only) How long to wait between retries of fetches. The maximum delay caused by retrying is 15 seconds by default, calculated as maxRetries * retryWait.
(仅限Netty)重试提取之间的等待时间。重试造成的最大延迟默认为15秒,按 maxRetries * retryWait 计算。
1.2.1
spark.shuffle.io.backLog -1 Length of the accept queue for the shuffle service. For large applications, this value may need to be increased, so that incoming connections are not dropped if the service cannot keep up with a large number of connections arriving in a short period of time. This needs to be configured wherever the shuffle service itself is running, which may be outside of the application (see spark.shuffle.service.enabled option below). If set below 1, will fallback to OS default defined by Netty's io.netty.util.NetUtil#SOMAXCONN.
随机服务的接受队列长度。对于大型应用程序,可能需要增加此值,以便在服务无法跟上短时间内到达的大量连接时不会丢弃传入连接。这需要在shuffle服务本身运行的任何地方进行配置,这可能在应用程序之外(请参阅下面的 spark.shuffle.service.enabled 选项)。如果设置为1以下,将回退到由Netty的 io.netty.util.NetUtil#SOMAXCONN 定义的操作系统默认值。
1.1.1
spark.shuffle.io.connectionTimeout value of spark.network.timeout 值 spark.network.timeout Timeout for the established connections between shuffle servers and clients to be marked as idled and closed if there are still outstanding fetch requests but no traffic no the channel for at least `connectionTimeout`.
如果仍然有未完成的获取请求但没有流量,则将shuffle服务器和客户端之间已建立的连接标记为空闲和关闭,至少'connections'。
1.2.0
spark.shuffle.io.connectionCreationTimeout value of spark.shuffle.io.connectionTimeout 值 spark.shuffle.io.connectionTimeout Timeout for establishing a connection between the shuffle servers and clients.
用于在shuffle服务器和客户端之间建立连接的路由器。
3.2.0
spark.shuffle.service.enabled false 假 Enables the external shuffle service. This service preserves the shuffle files written by executors e.g. so that executors can be safely removed, or so that shuffle fetches can continue in the event of executor failure. The external shuffle service must be set up in order to enable it. See dynamic allocation configuration and setup documentation for more information.
启用外部随机播放服务。此服务保留由执行器写入的shuffle文件,例如,以便可以安全地删除执行器,或者在执行器失败的情况下可以继续shuffle提取。必须设置外部随机服务才能启用它。有关详细信息,请参阅动态分配配置和设置文档。
1.2.0
spark.shuffle.service.port 7337 Port on which the external shuffle service will run.
运行外部shuffle服务的端口。
1.2.0
spark.shuffle.service.name spark_shuffle 火花混洗 The configured name of the Spark shuffle service the client should communicate with. This must match the name used to configure the Shuffle within the YARN NodeManager configuration (yarn.nodemanager.aux-services). Only takes effect when spark.shuffle.service.enabled is set to true.
客户端应该与之通信的Spark shuffle服务的配置名称。这必须与用于在YARN NodeManager配置( yarn.nodemanager.aux-services )中配置Shuffle的名称匹配。只有当 spark.shuffle.service.enabled 设置为true时才生效。
3.2.0
spark.shuffle.service.index.cache.size 100m Cache entries limited to the specified memory footprint, in bytes unless otherwise specified.
缓存条目限制为指定的内存占用,除非另有指定,否则以字节为单位。
2.3.0
spark.shuffle.service.removeShuffle false 假 Whether to use the ExternalShuffleService for deleting shuffle blocks for deallocated executors when the shuffle is no longer needed. Without this enabled, shuffle data on executors that are deallocated will remain on disk until the application ends.
当不再需要shuffle时,是否使用ExternalShuffleService删除已释放执行器的shuffle块。如果不启用此选项,则释放的执行器上的shuffle数据将保留在磁盘上,直到应用程序结束。
3.3.0
spark.shuffle.maxChunksBeingTransferred Long.MAX_VALUE 长最大值 The max number of chunks allowed to be transferred at the same time on shuffle service. Note that new incoming connections will be closed when the max number is hit. The client will retry according to the shuffle retry configs (see spark.shuffle.io.maxRetries and spark.shuffle.io.retryWait), if those limits are reached the task will fail with fetch failure.
在shuffle服务上允许同时传输的最大块数。请注意,当达到最大数量时,新的传入连接将被关闭。客户端将根据shuffle重试限制(参见 spark.shuffle.io.maxRetries 和 spark.shuffle.io.retryWait )重试,如果达到这些限制,则任务将失败并获取失败。
2.3.0
spark.shuffle.sort.bypassMergeThreshold 200 (Advanced) In the sort-based shuffle manager, avoid merge-sorting data if there is no map-side aggregation and there are at most this many reduce partitions.
(高级)在基于排序的shuffle管理器中,如果没有map端聚合,并且最多有这么多reduce分区,请避免合并排序数据。
1.1.1
spark.shuffle.sort.io.plugin.class org.apache.spark.shuffle.sort.io.LocalDiskShuffleDataIO Name of the class to use for shuffle IO.
用于shuffle IO的类的名称。
3.0.0
spark.shuffle.spill.compress true 真 Whether to compress data spilled during shuffles. Compression will use spark.io.compression.codec.
是否压缩洗牌时溢出的数据。压缩将使用 spark.io.compression.codec 。
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/weixin_40725706/article/detail/904204
推荐阅读
相关标签
  

闽ICP备14008679号