赞
踩
Spark官方下载地址:Spark下载地址
注意:选择正确的Spark版本,这里Hadoop版本为3.3.3,对应版本软件包为spark-3.2.1-bin-hadoop3.2.tgz。
*Yarn模式安装需提前安装Hadoop集群,安装手顺参考:
Apache-Hadoop3.3.3集群安装
Local 模式,就是不需要其他任何节点资源就可以在本地执行 Spark 代码的环境,一般用于教学,调试,演示等。
将 spark安装包文件上传到 Linux 并解压缩,放置在指定位置。
tar -zxvf spark-3.2.1-bin-hadoop3.2.tgz -C /opt/soft
cd /opt/soft
mv spark-3.2.1-bin-hadoop3.2/ spark-3.2.1-local
(1) 进入解压缩后的路径,执行如下指令
[root@node01 spark-3.2.1-local]# bin/spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/06/06 12:53:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context Web UI available at http://node01:4040 Spark context available as 'sc' (master = local[*], app id = local-1654491218662). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.2.1 /_/ Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_261) Type in expressions to have them evaluated. Type :help for more information. scala>
http://node01:4040
![image.png](https://img-blog.csdnimg.cn/img_convert/8b20f7debdd861c4acfbc00c844ecf3d.png#clientId=uef3f7261-c79d-4&crop=0&crop=0&crop=1&crop=1&from=paste&height=606&id=uba0a9e0c&margin=[object Object]&name=image.png&originHeight=757&originWidth=1920&originalType=binary&ratio=1&rotation=0&showTitle=false&size=48357&status=done&style=none&taskId=u2abf4a41-11e8-4cc5-97f0-426c086d181&title=&width=1536)
在解压缩文件夹下的 data 目录中,添加 word.txt 文件。在命令行工具中执行如下代码指令。
[root@node01 data]# touch word.txt
[root@node01 data]# echo "hello spark" > word.txt
scala> sc.textFile("data/word.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect
res0: Array[(String, Int)] = Array((hello,1), (spark,1))
[root@node01 spark-3.2.1-local]# bin/spark-submit \ > --class org.apache.spark.examples.SparkPi \ > --master local[2] \ > ./examples/jars/spark-examples_2.12-3.2.1.jar \ > 10 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 22/06/06 13:04:23 INFO SparkContext: Running Spark version 3.2.1 ... 22/06/06 13:04:23 INFO Utils: Successfully started service 'SparkUI' on port 4040. 22/06/06 13:04:23 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://node01:4040 22/06/06 13:04:23 INFO SparkContext: Added JAR file:/opt/soft/spark-3.2.1-local/examples/jars/spark-examples_2.12-3.2.1.jar at spark://node01:39994/jars/spark-examples_2.12-3.2.1.jar with timestamp 1654491863010 22/06/06 13:04:24 INFO Executor: Starting executor ID driver on host node01 .... 22/06/06 13:04:25 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 0.637438 s Pi is roughly 3.142031142031142 ....
参数说明:
(1) --class 表示要执行程序的主类;
(2) --master local[2] 部署模式,默认为本地模式,数字表示分配的虚拟CPU 核数量;
(3) spark-examples_2.12-3.0.0.jar 运行的应用类所在的 jar 包;
(4) 数字 10 表示程序的入口参数,用于设定当前应用的任务数量。
Standalone模式由 Spark 自身提供计算资源,无需其他框架提供资源。这种方式降低了和其他第三方资源框架的耦合性,独立性非常强。Spark主要是计算框架,而不是资源调度框架,所以本身提供的资源调度并不是它的强项。Spark与专业的资源调度框架集成会更靠谱,因为在国内工作中Yarn 使用的非常多,所以主要是基于Yarn搭建Spark运行环境。
将 spark-3.0.0-bin-hadoop3.2.tgz 文件上传到 linux 并解压缩,放置在指定位置。
tar -zxvf spark-3.2.1-bin-hadoop3.2.tgz -C /opt/soft
cd /opt/soft
mv spark-3.2.1-bin-hadoop3.2/ spark-3.2.1-yarn/
(1) 修改 hadoop 配置文件, 并分发
vim /opt/soft/hadoop-3.3.3/etc/hadoop/yarn-site.xml
<!--是否启动一个线程检查每个任务正使用的物理内存量,如果任务超出分配值,则直接将其杀掉,默认是 true -->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<!--是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,则直接将其杀掉,默认是 true -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
ssh_do_scp.sh ~/bin/node.list /opt/soft/hadoop-3.3.3/etc/hadoop/yarn-site.xml /opt/soft/hadoop-3.3.3/etc/hadoop/
(2) 修改 conf/spark-env.sh,添加 JAVA_HOME 和 YARN_CONF_DIR 配置
cp spark-env.sh.template spark-env.sh
vim spark-env.sh
export JAVA_HOME=/opt/soft/jdk1.8.0_261
YARN_CONF_DIR=/opt/soft/hadoop-3.3.3/etc/hadoop
启动 HDFS 以及 YARN 集群, 提交应用到yarn。
[root@node01 spark-3.2.1-yarn]# bin/spark-submit \ > --class org.apache.spark.examples.SparkPi \ > --master yarn \ > --deploy-mode cluster \ > ./examples/jars/spark-examples_2.12-3.2.1.jar \ > 10 2022-06-06 13:38:37,992 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2022-06-06 13:38:38,080 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at node02/192.168.31.102:8032 ... 2022-06-06 13:38:48,322 INFO yarn.Client: Submitting application application_1654493828584_0001 to ResourceManager 2022-06-06 13:38:48,711 INFO impl.YarnClientImpl: Submitted application application_1654493828584_0001 2022-06-06 13:38:49,718 INFO yarn.Client: Application report for application_1654493828584_0001 (state: ACCEPTED) 2022-06-06 13:38:49,722 INFO yarn.Client: client token: N/A diagnostics: AM container is launched, waiting for AM container to Register with RM ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1654493928432 final status: UNDEFINED tracking URL: http://node02:8088/proxy/application_1654493828584_0001/ user: root 2022-06-06 13:38:50,726 INFO yarn.Client: Application report for application_1654493828584_0001 (state: ACCEPTED) 2022-06-06 13:38:51,729 INFO yarn.Client: Application report for application_1654493828584_0001 (state: RUNNING) 2022-06-06 13:38:55,745 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: node01 ApplicationMaster RPC port: 37882 queue: default start time: 1654493928432 final status: UNDEFINED tracking URL: http://node02:8088/proxy/application_1654493828584_0001/ user: root report for application_1654493828584_0001 (state: FINISHED) 2022-06-06 13:39:07,807 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: node01 ApplicationMaster RPC port: 37882 queue: default start time: 1654493928432 final status: SUCCEEDED tracking URL: http://node02:8088/proxy/application_1654493828584_0001/ user: root 2022-06-06 13:39:07,818 INFO util.ShutdownHookManager: Shutdown hook called 2022-06-06 13:39:07,819 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-3515b475-3558-42e3-bbfc-89c56a10bc6f 2022-06-06 13:39:07,821 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-1f7e03ac-815f-4a16-904e-c6fb64b8d683
查看 http://node02:8088 页面,点击 History,查看历史页面
cd /opt/soft/spark-3.2.1-yarn/conf
(1) 修改 spark-defaults.conf.template 文件名为 spark-defaults.conf
mv spark-defaults.conf.template spark-defaults.conf
配置日志存储路径
vim spark-defaults.conf
spark.eventLog.enabled true
spark.eventLog.dir hdfs://node01:8020/sparkHistory
spark.yarn.historyServer.address=node01:18080
spark.history.ui.port=18080
注意:需要启动hadoop 集群,HDFS上的目录需要提前存在。
[root@node01 conf]# hadoop fs -mkdir /sparkHistory
[root@node01 conf]# hadoop fs -ls /
Found 3 items
drwxr-xr-x - root supergroup 0 2022-06-06 13:47 /sparkHistory
drwx------ - root supergroup 0 2022-06-02 11:46 /tmp
drwxr-xr-x - root supergroup 0 2022-06-06 13:38 /user
(3) 修改 spark-env.sh 文件, 添加日志配置
vim spark-env.sh
export SPARK_HISTORY_OPTS="
-Dspark.history.ui.port=18080
-Dspark.history.fs.logDirectory=hdfs://node01:8020/sparkHistory
-Dspark.history.retainedApplications=30"
参数 1 含义:WEB UI 访问的端口号为 18080
参数 2 含义:指定历史服务器日志存储路径
参数 3 含义:指定保存 Application 历史记录的个数,如果超过这个值,旧的应用程序信息将被删除,这个是内存中的应用数,而不是页面上显示的应用数。
(4) 启动历史服务
[root@node01 spark-3.2.1-yarn]# sbin/start-history-server.sh
starting org.apache.spark.deploy.history.HistoryServer, logging to /opt/soft/spark-3.2.1-yarn/logs/spark-root-org.apache.spark.deploy.history.HistoryServer-1-node01.out
(5) 重新提交应用
bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
./examples/jars/spark-examples_2.12-3.2.1.jar \
10
(6) Web 页面查看日志:http://node02:8088
(1)各种部署模式对比:
(2)端口号说明:
Spark 查看当前 Spark-shell 运行任务情况端口号:4040(计算)
Spark 历史服务器端口号:18080
Hadoop YARN 任务运行情况查看端口号:8088
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。