赞
踩
为了学习大数据处理相关技术,需要相关软件环境作为支撑实践的工具。而这些组件的部署相对繁琐,对于初学者来说不够友好。本人因为工作中涉及到该部分内容,通过参考网上的资料,经过几天摸索,实现了既简单又快捷的本地环境搭建方法。特写下该文章,加以记录,以期能够给初学者一些参考和帮助。
本文主要介绍基于docker在本地搭建spark on yarn以及hive(采用derby服务模式)。为什么没有使用mysql作为hive的metastore呢?因为既然是作为学习和测试用的环境,尽量让其保持简单,derby数据库不需要单独配置,直接启动即可使用,足够轻量和简便。
完整的代码已经提交到gitee spark-on-yarn-hive-derby
组件 | 版本 |
---|---|
spark镜像 | bitnami/spark:3.1.2 |
hadoop | 3.2.0 |
hive | 3.1.2 |
derby | 10.14.2.0 |
config/workers中配置的是作为工作节点的hostname,这个必须要和docker-compose-.yml中定义的hostname;保持一致
config/ssh_config用于免密登录
config中涉及到hostname的配置文件有core-site.xml、hive-site.xml、spark-hive-site.xml、yarn-site.xml,一定要和docker-compose-.yml中定义的hostname保持一致;
采用spark成熟镜像方案 bitnami/spark:3.1.2 作为原始镜像,在此基础上安装openssh,制作免密登录的基础镜像。由于master和worker节点均基于该基础镜像,其中的ssh密钥均相同,可以简化安装部署。
docker build -t my/spark-base:3.1.2 base/Dockerfile .
docker build -t my/spark-hadoop:3.1.2 -f on-yarn/Dockerfile .
# 创建集群
docker-compose -f on-yarn/docker-compose-manul.yml -p spark up -d
# 启动hadoop
docker exec -it spark-master-1 sh /opt/start-hadoop.sh
# 停止集群
docker-compose -f on-yarn/docker-compose-manul.yml -p spark stop
# 删除集群
docker-compose -f on-yarn/docker-compose-manul.yml -p spark down
# 启动集群
docker-compose -f on-yarn/docker-compose-manul.yml -p spark start
# 启动hadoop
docker exec -it spark-master-1 sh /opt/start-hadoop.sh
# 创建集群
docker-compose -f on-yarn/docker-compose-auto.yml -p spark up -d
# 停止集群
docker-compose -f on-yarn/docker-compose-auto.yml -p spark stop
# 启动集群
docker-compose -f on-yarn/docker-compose-auto.yml -p spark start
# 删除集群
docker-compose -f on-yarn/docker-compose-auto.yml -p spark down
docker build -t my/spark-hadoop-hive:3.1.2 -f on-yarn-hive/Dockerfile .
# 创建集群 docker-compose -f on-yarn-hive/docker-compose-manul.yml -p spark up -d # 启动hadoop docker exec -it spark-master-1 sh /opt/start-hadoop.sh # 启动hive docker exec -it spark-master-1 sh /opt/start-hive.sh # 停止集群 docker-compose -f on-yarn-hive/docker-compose-manul.yml -p spark stop # 删除集群 docker-compose -f on-yarn-hive/docker-compose-manul.yml -p spark down # 启动集群 docker-compose -f on-yarn-hive/docker-compose-manul.yml -p spark start # 启动hadoop docker exec -it spark-master-1 sh /opt/start-hadoop.sh # 启动hive docker exec -it spark-master-1 sh /opt/start-hive.sh
# 创建集群
docker-compose -f on-yarn-hive/docker-compose-auto.yml -p spark up -d
# 停止集群
docker-compose -f on-yarn-hive/docker-compose-auto.yml -p spark stop
# 启动集群
docker-compose -f on-yarn-hive/docker-compose-auto.yml -p spark start
# 删除集群
docker-compose -f on-yarn-hive/docker-compose-auto.yml -p spark down
spark-shell --master yarn << EOF
// 脚本内容
// 示例
val data = Array(1,2,3,4,5)
val distData = sc.parallelize(data)
val sum = distData.reduce((a,b)=>a+b)
println("Sum: "+sum)
EOF
hive -e "create table demo(name string)"
例如,我用的是windows系统,则使用SwitchHosts软件,修改上述hostname指向的IP地址,其中192.168.138.1是虚拟网络适配器的IP
192.168.138.1 local-spark-worker1
192.168.138.1 local-spark-master
上传spark依赖jar包
hdfs dfs -mkdir -p /spark/jars
hdfs dfs -put -f /opt/bitnami/spark/jars/* /spark/jars
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.12</artifactId> <version>3.1.2</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-yarn_2.12</artifactId> <version>3.1.2</version> </dependency> <dependency> <groupId>org.junit.jupiter</groupId> <artifactId>junit-jupiter</artifactId> <version>5.9.1</version> <scope>test</scope> </dependency>
以cluster模式提交spark-sql;浏览器输入http://localhost:9870打开hdfs管理界面,创建目录/user/my,进入该目录并上传spark-sql-cluster.jar
package org.demo.spark; import org.apache.spark.SparkConf; import org.apache.spark.deploy.yarn.Client; import org.apache.spark.deploy.yarn.ClientArguments; import org.junit.jupiter.api.Test; public class SparkOnYarnTest { @Test public void yarnApiSubmit() { // prepare arguments to be passed to // org.apache.spark.deploy.yarn.Client object String[] args = new String[] { "--jar","hdfs:///user/my/spark-sql-cluster.jar", "--class", "org.apache.spark.sql.hive.cluster.SparkSqlCliClusterDriver", "--arg", "spark-internal", "--arg", "-e", "--arg", "\\\"insert into demo(name) values('zhangsan')\\\"" }; // identify that you will be using Spark as YARN mode // System.setProperty("SPARK_YARN_MODE", "true"); // create an instance of SparkConf object String appName = "Yarn Client Remote App"; SparkConf sparkConf = new SparkConf(); sparkConf.setMaster("yarn"); sparkConf.setAppName(appName); sparkConf.set("spark.submit.deployMode", "cluster"); sparkConf.set("spark.yarn.jars", "hdfs:///spark/jars/*.jar"); sparkConf.set("spark.hadoop.yarn.resourcemanager.hostname", "local-spark-master"); sparkConf.set("spark.hadoop.yarn.resourcemanager.address", "local-spark-master:8032"); sparkConf.set("spark.hadoop.yarn.resourcemanager.scheduler.address", "local-spark-master:8030"); // create ClientArguments, which will be passed to Client ClientArguments cArgs = new ClientArguments(args); // create an instance of yarn Client client Client client = new Client(cArgs, sparkConf, null); // submit Spark job to YARN client.run(); } }
使用 Docker 快速部署 Spark + Hadoop 大数据集群
SparkSQL 与 Hive 整合关键步骤解析
spark-sql-for-cluster
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。