当前位置:   article > 正文

探究Flink on YARN模式下TaskManager的内存分配

flink on yarn taskmanager 比传参小

一个问题

我们使用如下的参数提交了Flink on YARN作业(per-job模式)。

  1. /opt/flink-1.9.0/bin/flink run \
  2. --detached \
  3. --jobmanager yarn-cluster \
  4. --yarnname "x.y.z" \
  5. --yarnjobManagerMemory 2048 \
  6. --yarntaskManagerMemory 4096 \
  7. --yarnslots 2 \
  8. --parallelism 20 \
  9. --class x.y.z \
  10. xyz-1.0.jar

该作业启动了10个TaskManager,并正常运行。来到该任务的Web界面,随便打开一个TaskManager页面,看看它的内存情况。

195230-14e5af9a29dc1f75.png

可见,虽然我们在参数中设置了TaskManager的内存为4GB大,但是上图显示的JVM堆大小只有2.47GB,另外还有一项“Flink Managed Memory”为1.78GB。在用VisualVM监控YarnTaskExecutorRunner时,会发现其JVM内存参数被如下设置:

195230-254da625d965591c.png

显然Xmx+MaxDirectMemorySize才是我们在启动参数中设定的TM内存大小(4GB)。那么为什么会这样设置?“Flink Managed Memory”又是什么鬼?下面就来弄懂这些问题。

TaskManager内存布局

如下图所示。

195230-1dec9b300b1f8e3f.png

为了减少object overhead,Flink主要采用序列化的方式存储各种对象。序列化存储的最小单位叫做MemorySegment,底层为字节数组,大小由taskmanager.memory.segment-size参数指定,默认32KB大。下面分别介绍各块内存:

  • 网络缓存(Network Buffer):用于网络传输及与网络相关的动作(shuffle、广播等)的内存块,由MemorySegment组成。从Flink 1.5版本之后,网络缓存固定分配在堆外,这样可以充分利用零拷贝等技术。与它相关的三个参数及我们的设定值如下:
  1. # 网络缓存占TM内存的默认比例,默认0.1
  2. taskmanager.network.memory.fraction: 0.15
  3. # 网络缓存的最小值和最大值 ,默认64MB和1GB
  4. taskmanager.network.memory.min: 128mb
  5. taskmanager.network.memory.max: 1gb
  • 托管内存(Flink Managed Memory):用于所有Flink内部算子逻辑的内存分配,以及中间数据的存储,同样由MemorySegment组成,并通过Flink的MemoryManager组件管理。它默认在堆内分配,如果开启堆外内存分配的开关,也可以在堆内、堆外同时分配。与它相关的两个参数如下:
  1. # 堆内托管内存占TM堆内内存的比例,默认0.7
  2. taskmanager.memory.fraction: 0.7
  3. # 是否允许分配堆外托管内存,默认不允许
  4. taskmanager.memory.off-heap: false

由此也可见,Flink的内存管理不像Spark一样区分Storage和Execution内存,而是直接合二为一,更加灵活。

  • 空闲内存(Free):虽然名为空闲,但实际上是存储用户代码和数据结构的,固定在堆内,可以理解为堆内内存除去托管内存后剩下的那部分。

如果我们想知道文章开头的问题中各块内存的大小是怎么来的,最好的办法自然是去读源码。下面以Flink 1.9.0源码为例来探索。

TaskManager内存分配逻辑

YARN per-job集群的启动入口位于o.a.f.yarn.YarnClusterDescriptor类中。

  1. public ClusterClient<ApplicationId> deployJobCluster(
  2. ClusterSpecification clusterSpecification,
  3. JobGraph jobGraph,
  4. boolean detached) throws ClusterDeploymentException {
  5. // this is required because the slots are allocated lazily
  6. jobGraph.setAllowQueuedScheduling(true);
  7. try {
  8. return deployInternal(
  9. clusterSpecification,
  10. "Flink per-job cluster",
  11. getYarnJobClusterEntrypoint(),
  12. jobGraph,
  13. detached);
  14. } catch (Exception e) {
  15. throw new ClusterDeploymentException("Could not deploy Yarn job cluster.", e);
  16. }
  17. }

其中,ClusterSpecification对象持有该集群的4个基本参数:JobManager内存大小、TaskManager内存大小、TaskManager数量、每个TaskManager的slot数。而deployInternal()方法在开头调用了o.a.f.yarn.AbstractYarnClusterDescriptor抽象类的validateClusterSpecification()方法,用于校验ClusterSpecification是否合法。

  1. private void validateClusterSpecification(ClusterSpecification clusterSpecification) throws FlinkException {
  2. try {
  3. final long taskManagerMemorySize = clusterSpecification.getTaskManagerMemoryMB();
  4. // We do the validation by calling the calculation methods here
  5. // Internally these methods will check whether the cluster can be started with the provided
  6. // ClusterSpecification and the configured memory requirements
  7. final long cutoff = ContaineredTaskManagerParameters.calculateCutoffMB(flinkConfiguration, taskManagerMemorySize);
  8. TaskManagerServices.calculateHeapSizeMB(taskManagerMemorySize - cutoff, flinkConfiguration);
  9. } catch (IllegalArgumentException iae) {
  10. throw new FlinkException("Cannot fulfill the minimum memory requirements with the provided " +
  11. "cluster specification. Please increase the memory of the cluster.", iae);
  12. }
  13. }

ClusterSpecification.getTaskManagerMemoryMB()方法返回的就是-ytm/--yarntaskManagerMemory参数设定的内存,最终反映在Flink代码中都是taskmanager.heap.size配置项的值。

接下来首先调用ContaineredTaskManagerParameters.calculateCutoffMB()方法,它负责计算一个承载TM的YARN Container需要预留多少内存给TM之外的逻辑来使用。

  1. public static long calculateCutoffMB(Configuration config, long containerMemoryMB) {
  2. Preconditions.checkArgument(containerMemoryMB > 0);
  3. // (1) check cutoff ratio
  4. final float memoryCutoffRatio = config.getFloat(
  5. ResourceManagerOptions.CONTAINERIZED_HEAP_CUTOFF_RATIO);
  6. if (memoryCutoffRatio >= 1 || memoryCutoffRatio <= 0) {
  7. throw new IllegalArgumentException("The configuration value '"
  8. + ResourceManagerOptions.CONTAINERIZED_HEAP_CUTOFF_RATIO.key() + "' must be between 0 and 1. Value given="
  9. + memoryCutoffRatio);
  10. }
  11. // (2) check min cutoff value
  12. final int minCutoff = config.getInteger(
  13. ResourceManagerOptions.CONTAINERIZED_HEAP_CUTOFF_MIN);
  14. if (minCutoff >= containerMemoryMB) {
  15. throw new IllegalArgumentException("The configuration value '"
  16. + ResourceManagerOptions.CONTAINERIZED_HEAP_CUTOFF_MIN.key() + "'='" + minCutoff
  17. + "' is larger than the total container memory " + containerMemoryMB);
  18. }
  19. // (3) check between heap and off-heap
  20. long cutoff = (long) (containerMemoryMB * memoryCutoffRatio);
  21. if (cutoff < minCutoff) {
  22. cutoff = minCutoff;
  23. }
  24. return cutoff;
  25. }

该方法的执行流程如下:

  1. 获取containerized.heap-cutoff-ratio参数,它代表Container预留的非TM内存占设定的TM内存的比例,默认值0.25;
  2. 获取containerized.heap-cutoff-min参数,它代表Container预留的非TM内存的最小值,默认值600MB;
  3. 按比例计算预留内存,并保证结果不小于最小值。

由此可见,在Flink on YARN时,我们设定的TM内存实际上是Container的内存。也就是说,一个TM能利用的总内存(包含堆内和堆外)是:

tm_total_memory = taskmanager.heap.size - max[containerized.heap-cutoff-min, taskmanager.heap.size * containerized.heap-cutoff-ratio]

用文章开头给的参数实际计算一下:

tm_total_memory = 4096 - max[600, 4096 * 0.25] = 3072

接下来看TaskManagerServices.calculateHeapSizeMB()方法。

  1. public static long calculateHeapSizeMB(long totalJavaMemorySizeMB, Configuration config) {
  2. Preconditions.checkArgument(totalJavaMemorySizeMB > 0);
  3. // all values below here are in bytes
  4. final long totalProcessMemory = megabytesToBytes(totalJavaMemorySizeMB);
  5. final long networkReservedMemory = getReservedNetworkMemory(config, totalProcessMemory);
  6. final long heapAndManagedMemory = totalProcessMemory - networkReservedMemory;
  7. if (config.getBoolean(TaskManagerOptions.MEMORY_OFF_HEAP)) {
  8. final long managedMemorySize = getManagedMemoryFromHeapAndManaged(config, heapAndManagedMemory);
  9. ConfigurationParserUtils.checkConfigParameter(managedMemorySize < heapAndManagedMemory, managedMemorySize,
  10. TaskManagerOptions.MANAGED_MEMORY_SIZE.key(),
  11. "Managed memory size too large for " + (networkReservedMemory >> 20) +
  12. " MB network buffer memory and a total of " + totalJavaMemorySizeMB +
  13. " MB JVM memory");
  14. return bytesToMegabytes(heapAndManagedMemory - managedMemorySize);
  15. }
  16. else {
  17. return bytesToMegabytes(heapAndManagedMemory);
  18. }
  19. }

为了简化问题及符合我们的实际应用,就不考虑开启堆外托管内存的情况了。这里涉及到了计算Network buffer大小的方法NettyShuffleEnvironmentConfiguration.calculateNetworkBufferMemory()。

  1. public static long calculateNetworkBufferMemory(long totalJavaMemorySize, Configuration config) {
  2. final int segmentSize = ConfigurationParserUtils.getPageSize(config);
  3. final long networkBufBytes;
  4. if (hasNewNetworkConfig(config)) {
  5. float networkBufFraction = config.getFloat(NettyShuffleEnvironmentOptions.NETWORK_BUFFERS_MEMORY_FRACTION);
  6. long networkBufSize = (long) (totalJavaMemorySize * networkBufFraction);
  7. networkBufBytes = calculateNewNetworkBufferMemory(config, networkBufSize, totalJavaMemorySize);
  8. } else {
  9. // use old (deprecated) network buffers parameter
  10. // 旧版逻辑,不再看了
  11. }
  12. return networkBufBytes;
  13. }
  14. private static long calculateNewNetworkBufferMemory(Configuration config, long networkBufSize, long maxJvmHeapMemory) {
  15. float networkBufFraction = config.getFloat(NettyShuffleEnvironmentOptions.NETWORK_BUFFERS_MEMORY_FRACTION);
  16. long networkBufMin = MemorySize.parse(config.getString(NettyShuffleEnvironmentOptions.NETWORK_BUFFERS_MEMORY_MIN)).getBytes();
  17. long networkBufMax = MemorySize.parse(config.getString(NettyShuffleEnvironmentOptions.NETWORK_BUFFERS_MEMORY_MAX)).getBytes();
  18. int pageSize = ConfigurationParserUtils.getPageSize(config);
  19. checkNewNetworkConfig(pageSize, networkBufFraction, networkBufMin, networkBufMax);
  20. long networkBufBytes = Math.min(networkBufMax, Math.max(networkBufMin, networkBufSize));
  21. ConfigurationParserUtils.checkConfigParameter(/*...*/);
  22. return networkBufBytes;
  23. }

由此可见,网络缓存的大小这样确定:

network_buffer_memory = min[taskmanager.network.memory.max, max(askmanager.network.memory.min, tm_total_memory * taskmanager.network.memory.fraction)]

代入数值:

network_buffer_memory = min[1024, max(128, 3072 * 0.15)] = 460.8

也就是说,TM真正使用的堆内内存为:

tm_heap_memory = tm_total_memory - network_buffer_memory = 3072 - 460.8 ≈ 2611

这完全符合VisualVM截图中的-Xms/-Xmx设定。

同理,可以看一下TaskManager UI中的网络缓存MemorySegment计数。

195230-23ee61466807899d.png

通过计算得知,网络缓存的实际值与上面算出来的network_buffer_memory值是非常接近的。

那么堆内托管内存的值是怎么计算出来的呢?前面提到了托管内存由MemoryManager管理,来看看TaskManagerServices.createMemoryManager()方法,它用设定好的参数来初始化一个MemoryManager。

  1. private static MemoryManager createMemoryManager(
  2. TaskManagerServicesConfiguration taskManagerServicesConfiguration) throws Exception {
  3. long configuredMemory = taskManagerServicesConfiguration.getConfiguredMemory();
  4. MemoryType memType = taskManagerServicesConfiguration.getMemoryType();
  5. final long memorySize;
  6. boolean preAllocateMemory = taskManagerServicesConfiguration.isPreAllocateMemory();
  7. if (configuredMemory > 0) {
  8. if (preAllocateMemory) {
  9. LOG.info(/*...*/);
  10. } else {
  11. LOG.info(/*...*/);
  12. }
  13. memorySize = configuredMemory << 20; // megabytes to bytes
  14. } else {
  15. // similar to #calculateNetworkBufferMemory(TaskManagerServicesConfiguration tmConfig)
  16. float memoryFraction = taskManagerServicesConfiguration.getMemoryFraction();
  17. if (memType == MemoryType.HEAP) {
  18. long freeHeapMemoryWithDefrag = taskManagerServicesConfiguration.getFreeHeapMemoryWithDefrag();
  19. // network buffers allocated off-heap -> use memoryFraction of the available heap:
  20. long relativeMemSize = (long) (freeHeapMemoryWithDefrag * memoryFraction);
  21. if (preAllocateMemory) {
  22. LOG.info(/*...*/);
  23. } else {
  24. LOG.info(/*...*/);
  25. }
  26. memorySize = relativeMemSize;
  27. } else if (memType == MemoryType.OFF_HEAP) {
  28. long maxJvmHeapMemory = taskManagerServicesConfiguration.getMaxJvmHeapMemory();
  29. // The maximum heap memory has been adjusted according to the fraction (see
  30. // calculateHeapSizeMB(long totalJavaMemorySizeMB, Configuration config)), i.e.
  31. // maxJvmHeap = jvmTotalNoNet - jvmTotalNoNet * memoryFraction = jvmTotalNoNet * (1 - memoryFraction)
  32. // directMemorySize = jvmTotalNoNet * memoryFraction
  33. long directMemorySize = (long) (maxJvmHeapMemory / (1.0 - memoryFraction) * memoryFraction);
  34. if (preAllocateMemory) {
  35. LOG.info(/*...*/);
  36. } else {
  37. LOG.info(/*...*/);
  38. }
  39. memorySize = directMemorySize;
  40. } else {
  41. throw new RuntimeException("No supported memory type detected.");
  42. }
  43. }
  44. // now start the memory manager
  45. final MemoryManager memoryManager;
  46. try {
  47. memoryManager = new MemoryManager(
  48. memorySize,
  49. taskManagerServicesConfiguration.getNumberOfSlots(),
  50. taskManagerServicesConfiguration.getPageSize(),
  51. memType,
  52. preAllocateMemory);
  53. } catch (OutOfMemoryError e) {
  54. // ...
  55. }
  56. return memoryManager;
  57. }

简要叙述一下流程:

  1. 获取taskmanager.memory.size参数,用来确定托管内存的绝对大小;
  2. 如果taskmanager.memory.size未设置,就继续获取前面提到过的taskmanager.memory.fraction参数;
  3. 只考虑堆内内存的情况,调用TaskManagerServicesConfiguration.getFreeHeapMemoryWithDefrag()方法,先主动触发GC,然后获取可用的堆内存量。可见,如果没有意外,程序初始化时该方法返回的值与前文的-Xms/-Xmx应该相同;
  4. 计算托管内存大小和其他参数,返回MemoryManager实例。

一般来讲我们都不会简单粗暴地设置taskmanager.memory.size。所以:

flink_managed_memory = tm_heap_memory * taskmanager.memory.fraction = 2611 * 0.7 ≈ 1827

这就是TaskManager UI中显示的托管内存大小了。

The End

晚安晚安。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/菜鸟追梦旅行/article/detail/696373
推荐阅读
相关标签
  

闽ICP备14008679号