当前位置:   article > 正文

自编译Spark3.X,支持CDH 5.16.2(hadoop-2.6.0-cdh5.16.2)_spark3 hadoop2.6

spark3 hadoop2.6

参考文章

Kyuubi实践 | 编译 Spark3.1 以适配 CDH5 并集成 Kyuubi-技术圈 (proginn.com)https://jishuin.proginn.com/p/763bfbd67cf6

https://issues.apache.org/jira/browse/SPARK-35758https://jishuin.proginn.com/p/763bfbd67cf6

[SPARK-35758][DOCS] Update the document about building Spark with Hadoop for Hadoop 2.x and 3.x by sarutak · Pull Request #32917 · apache/spark · GitHubhttps://github.com/apache/spark/pull/32917

由于spark3不再直接支持hadoop2.6以下的低版本,而我们生产环境仍然使用的 CDH 5.16.2(hadoop-2.6.0-cdh5.16.2)的内核版本较低,需要自行编译spark3。

已经使用本文方法成功编译{saprk3.0.3,spark3.1.1,spark3.1.2,spark3.1.3,spark3.2.1},因决定使用次新版本作为生产环境的spark版本,故行文以spark3.1.3为例

1)编译环境准备

提前准备好java,scala,maven环境

  1. java -version #1.8.0_311
  2. mvn -v #Apache Maven 3.6.3
  3. scala -version #2.12.10

增加一个环境变量(/etc/profile),让Maven在编译时可以使用更多的内存:

export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"

2)下载源码

  1. #创建目录
  2. sudo mkdir /bi_bigdata/user_shell/spark
  3. #下载安装包到目录
  4. wget https://archive.apache.org/dist/spark/spark-3.1.3/spark-3.1.3.tgz -P /bi_bigdata/user_shell/spark
  5. #解压到指定文件夹
  6. tar -zxvf /bi_bigdata/user_shell/spark/spark-3.1.3.tgz -C /bi_bigdata/user_shell/spark/
  7. cd /bi_bigdata/user_shell/spark/spark-3.1.3

3)修改部分不兼容代码

        主要针对hadoop版本低于2.6.4 的修改,主要根据报错进行调整的

①第一处修改yarn模块

vim resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
  1. /*注释
  2. sparkConf.get(ROLLED_LOG_INCLUDE_PATTERN).foreach { includePattern =>
  3. try {
  4. val logAggregationContext = Records.newRecord(classOf[LogAggregationContext])
  5. logAggregationContext.setRolledLogsIncludePattern(includePattern)
  6. sparkConf.get(ROLLED_LOG_EXCLUDE_PATTERN).foreach { excludePattern =>
  7. logAggregationContext.setRolledLogsExcludePattern(excludePattern)
  8. }
  9. appContext.setLogAggregationContext(logAggregationContext)
  10. } catch {
  11. case NonFatal(e) =>
  12. logWarning(s"Ignoring ${ROLLED_LOG_INCLUDE_PATTERN.key} because the version of YARN " +
  13. "does not support it", e)
  14. }
  15. }
  16. appContext.setUnmanagedAM(isClientUnmanagedAMEnabled)
  17. sparkConf.get(APPLICATION_PRIORITY).foreach { appPriority =>
  18. appContext.setPriority(Priority.newInstance(appPriority))
  19. }
  20. appContext
  21. }
  22. */
  23. /*替换*/
  24. sparkConf.get(ROLLED_LOG_INCLUDE_PATTERN).foreach { includePattern =>
  25. try {
  26. val logAggregationContext = Records.newRecord(classOf[LogAggregationContext])
  27. // These two methods were added in Hadoop 2.6.4, so we still need to use reflection to
  28. // avoid compile error when building against Hadoop 2.6.0 ~ 2.6.3.
  29. val setRolledLogsIncludePatternMethod =
  30. logAggregationContext.getClass.getMethod("setRolledLogsIncludePattern", classOf[String])
  31. setRolledLogsIncludePatternMethod.invoke(logAggregationContext, includePattern)
  32. sparkConf.get(ROLLED_LOG_EXCLUDE_PATTERN).foreach { excludePattern =>
  33. val setRolledLogsExcludePatternMethod =
  34. logAggregationContext.getClass.getMethod("setRolledLogsExcludePattern", classOf[String])
  35. setRolledLogsExcludePatternMethod.invoke(logAggregationContext, excludePattern)
  36. }
  37. appContext.setLogAggregationContext(logAggregationContext)
  38. } catch {
  39. case NonFatal(e) =>
  40. logWarning(s"Ignoring ${ROLLED_LOG_INCLUDE_PATTERN.key} because the version of YARN " +
  41. "does not support it", e)
  42. }
  43. }
  44. appContext
  45. }

②第二处修改Utils 模块

vim core/src/main/scala/org/apache/spark/util/Utils.scala
  1. //注释掉
  2. //import org.apache.hadoop.util.{RunJar, StringUtils}
  3. //替换为
  4. import org.apache.hadoop.util.{RunJar}
  5. def unpack(source: File, dest: File): Unit = {
  6. // StringUtils 在hadoop2.6.0中引用不到,所以取消此import,然后修改为相似的功能
  7. // val lowerSrc = StringUtils.toLowerCase(source.getName)
  8. if (source.getName == null) {
  9. throw new NullPointerException
  10. }
  11. val lowerSrc = source.getName.toLowerCase()
  12. if (lowerSrc.endsWith(".jar")) {
  13. RunJar.unJar(source, dest, RunJar.MATCH_ANY)
  14. } else if (lowerSrc.endsWith(".zip")) {
  15. FileUtil.unZip(source, dest)
  16. } else if (
  17. lowerSrc.endsWith(".tar.gz") || lowerSrc.endsWith(".tgz") || lowerSrc.endsWith(".tar")) {
  18. FileUtil.unTar(source, dest)
  19. } else {
  20. logWarning(s"Cannot unpack $source, just copying it to $dest.")
  21. copyRecursive(source, dest)
  22. }
  23. }

③第三处修改HttpSecurityFilter 模块

vim core/src/main/scala/org/apache/spark/ui/HttpSecurityFilter.scala
  1. private val parameterMap: Map[String, Array[String]] = {
  2. super.getParameterMap().asScala.map { case (name, values) =>
  3. //Unapplied methods are only converted to functions when a function type is expected.
  4. //You can make this conversion explicit by writing `stripXSS _` or `stripXSS(_)` instead of `stripXSS`.
  5. // stripXSS(name) -> values.map(stripXSS)
  6. stripXSS(name) -> values.map(stripXSS(_))
  7. }.toMap
  8. }

4)修改 Spark 的父 pom 文件,将原来的central替换为https://mvnrepository.com/repos/central ,并添加CDH仓库

vim pom.xml
  1. <repository>
  2. <!--
  3. This is used as a fallback when the first try fails.
  4. -->
  5. <id>central</id>
  6. <name>Maven Repository</name>
  7. <url>https://mvnrepository.com/repos/central</url>
  8. <!--<url>https://repo.maven.apache.org/maven2</url>-->
  9. <releases>
  10. <enabled>true</enabled>
  11. </releases>
  12. <snapshots>
  13. <enabled>false</enabled>
  14. </snapshots>
  15. </repository>
  16. <!-- 添加CDH仓库-->
  17. <repository>
  18. <id>cloudera</id>
  19. <name>cloudera Repository</name>
  20. <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
  21. </repository>
  22. <!-- 添加CDH plugin仓库-->
  23. <pluginRepository>
  24. <id>cloudera</id>
  25. <name>Cloudera Repositories</name>
  26. <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
  27. </pluginRepository>

5)编译生成安装包

在spark的解压目录中进行编译可以分发的二进制压缩包

根据官方的提示,在编译hadoop2.x版本时,指定-Phadoop-2.7

[SPARK-35758][DOCS] Update the document about building Spark with Hadoop for Hadoop 2.x and 3.x by sarutak · Pull Request #32917 · apache/spark · GitHub

./dev/make-distribution.sh --name 2.6.0-cdh5.16.2 --pip --tgz -Phive -Phive-thriftserver  -Pmesos -Pyarn -Pkubernetes -Phadoop-2.7 -Dhadoop.version=2.6.0-cdh5.16.2 -Dscala.version=2.12.10 -X

 编译完成后,可以在当前目录看到对应可分发的tgz安装包,接下来就可以部署到生产环境了。

  1. #查看生成的tgz安装包
  2. ll -h |grep tgz |grep spark
  3. #spark-3.1.3-bin-2.6.0-cdh5.16.2.tgz

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/2023面试高手/article/detail/625488
推荐阅读
相关标签
  

闽ICP备14008679号