赞
踩
Zeppelin是一个开源的Apache的孵化项目. 它是一款基本web的notebook工具,支持交互式数据分析。通过插件的方式接入各种解释器(interpreter),使得用户能够以特定的语言或数据处理后端来完成交互式查询,并快速实现数据可视化。
Zeppelin:数据进行查询分析并生成报表
Spark: 给Zeppelin提供后台数据引擎,基于RDD(弹性分布式数据集)的一种计算模型,可以分布式处理大量极数据的,将大量集数据先拆分,分别进行计算,然后再将计算后的结果进行合并。
由于CentOS默认安装的是OpenJDK,不满足我们的要求,所以需要我们先卸载,再重新安装上Oracle版本的JDK。
1.1. 查询java版本:rpm -qa|grep java
1.2. 卸载openjdk:yum -y remove java-xxx-openjdk-xxx
Oracle版本的JDK下载地址:https://download.oracle.com/otn/java/jdk/8u311-b11/4d5417147a92418ea8b615e228bb6935/jdk-8u311-linux-x64.rpm
rpm 安装jdk
rpm -ivh jdk-8u311-linux-x64.rpm
查看安装版本
添加至环境变量
- vim /etc/profile
- export JAVA_HOME=/usr/java/jdk1.8.0_311-amd64
- export JAVA_BIN=/usr/java/jdk1.8.0_311-amd64/bin
- export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JAVA_HOME/jre/lib/rt.jar
- export PATH=$PATH:$JAVA_HOME/bin
- # 保存退出,执行以下命令使环境变量立即生效
- source /etc/profile
- 192.168.0.61
- hostnamectl set-hostname master
- 192.168.0.64
- hostnamectl set-hostname slave01
- 192.168.0.65
- hostnamectl set-hostname slave02
-
- # 配置hosts(IP和主机名映射)
- vim /etc/hosts
- 192.168.0.61 master
- 192.168.0.64 slave01
- 192.168.0.65 slave02
ntpdate cn.pool.ntp.org
- vi /etc/selinux/config
- 将SELINUX=enforcing改为SELINUX=disabled
- 设置后需要重启才能生效
-
- systemctl stop firewalld
- systemctl disable firewalld
- tar -zxf scala3-3.1.0.tar.gz -C /usr/local/
- mv scala3-3.1.0 scala3
- vim /etc/profile
- export PATH=$PATH:$JAVA_HOME/bin:/usr/local/scala3/bin
- # 保存退出,执行以下命令使环境变量立即生效
- source /etc/profile
查看是否安装成功
- useradd hadoop
- passwd hadoop
切换用户hadoop
su hadoop
- ssh-keygen -t rsa
- ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub master
- ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub slave01
- ssh-copy-id -i /home/hadoop/.ssh/id_rsa.pub slave02
以上步骤在每个虚拟机都要执行
hadoop下载
Index of /hadoop/common/hadoop-3.3.1
https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
解压安装包
- tar -zxf hadoop-3.3.1.tar.gz -C /data/java-service/
- cd /data/java-service/
- mv hadoop-3.3.1/ hadoop
- cd hadoop
- # 创建dfs相关目录
- mkdir -p dfs/name
- mkdir -p dfs/data
- mkdir -p dfs/namesecondary
配置环境变量
- vim /etc/profile
- # hadoop
- export HADOOP_HOME=/data/java-service/hadoop
- export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
- # 保存退出,执行以下命令使环境变量立即生效
- source /etc/profile
进入hadoop配置文件目录开始进行参数配置
cp core-site.xml core-site.xml.cp
详细参数请参考: https://hadoop.apache.org/docs/r3.3.1/hadoop-project-dist/hadoop-common/core-default.xml
<configuration></configuration>中增加如下内容
vim core-site.xml
- <property>
- <name>fs.defaultFS</name>
- <value>hdfs://master:9000</value>
- <description>NameNode URI.</description>
- </property>
- <property>
- <name>io.file.buffer.size</name>
- <value>131072</value>
- <description>Size of read/write buffer used inSequenceFiles.</description>
- </property>
cp hdfs-site.xml hdfs-site.xml.cp
详细参数请参考: http://hadoop.apache.org/docs/r3.3.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
<configuration></configuration>中增加如下内容
vim hdfs-site.xml
- <property>
- <name>dfs.namenode.secondary.http-address</name>
- <value>master:50070</value>
- <description>The secondary namenode http server address andport.</description>
- </property>
- <property>
- <name>dfs.namenode.name.dir</name>
- <value>file:///data/java-service/hadoop/dfs/name</value>
- <description>Path on the local filesystem where the NameNodestores the namespace and transactions logs persistently.</description>
- </property>
- <property>
- <name>dfs.datanode.data.dir</name>
- <value>file:///data/java-service/hadoop/dfs/data</value>
- <description>Comma separated list of paths on the local filesystemof a DataNode where it should store its blocks.</description>
- </property>
- <property>
- <name>dfs.namenode.checkpoint.dir</name>
- <value>file:///data/java-service/hadoop/dfs/namesecondary</value>
- <description>Determines where on the local filesystem the DFSsecondary name node should store the temporary images to merge. If this is acomma-delimited list of directories then the image is replicated in all of thedirectories for redundancy.</description>
- </property>
- <property>
- <name>dfs.replication</name>
- <value>2</value>
- </property>

cp mapred-site.xml mapred-site.xml.cp
详细参数请参考:https://hadoop.apache.org/docs/r3.3.1/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
<configuration></configuration>中增加如下内容
vim mapred-site.xml
- <property>
- <name>mapreduce.framework.name</name>
- <value>yarn</value>
- <description>Theruntime framework for executing MapReduce jobs. Can be one of local, classic oryarn.</description>
- </property>
- <property>
- <name>mapreduce.jobhistory.address</name>
- <value>master:10020</value>
- <description>MapReduce JobHistoryServer IPC host:port</description>
- </property>
-
- <property>
- <name>mapreduce.jobhistory.webapp.address</name>
- <value>master:19888</value>
- <description>MapReduce JobHistoryServer Web UI host:port</description>
- </property>

cp yarn-site.xml yarn-site.xml.cp
默认配置链接:http://hadoop.apache.org/docs/r3.3.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
<configuration></configuration>中增加如下内容
vim yarn-site.xml
- <property>
- <name>yarn.resourcemanager.hostname</name>
- <value>master</value>
- <description>The hostname of theRM.</description>
- </property>
- <property>
- <name>yarn.nodemanager.aux-services</name>
- <value>mapreduce_shuffle</value>
- <description>Shuffle service that needs to be set for Map Reduceapplications.</description>
- </property>
export JAVA_HOME=/usr/java/jdk1.8.0_311-amd64
- slave01
- slave02
chown hadoop:hadoop -R /data/java-service/hadoop/logs/
- scp -r /data/java-service/hadoop root@slave01:/data/java-service/
- scp -r /data/java-service/hadoop root@slave02:/data/java-service/
两个节点也需要配置环境变量 /etc/profile
报错信息
1. ERROR: Cannot set priority of namenode process
1. Does not contain a valid host:port authority
修复:经调查结果是由于主机的hostname不合法,修改为不包含着‘.’ '/' '_'等非法字符。
2.Directory /data/java-service/hadoop/dfs/name is in an inconsistent state
权限问题 chown -R hadoop:hadoop /data/java-service/hadoop/dfs
3. <hostname>:9000 failed on connection exception: java.net.ConnectException: 拒绝连接
查看 ./logs/hadoop-hadoop-secondarynamenode-master.log
4. JAVA.IO.IOEXCEPTION: NAMENODE IS NOT FORMATTED
解决:
需要删除 /hadoop安装目录/dfs/ 下文件之后重新初始化
/hadoop安装目录/bin/hdfs namenode -format
5. 运行时发现master没有运行DataNode
./sbin/hadoop-daemon.sh start datanode
报错调试查找的相关资料:
ubuntu 18.04配置Hadoop 3.1.1些许问题记录_ygd11的专栏-CSDN博客
hadoop错误:Does not contain a valid host:port authority - 大墨垂杨 - 博客园
Hadoop集群启动NameNode错误 JAVA.IO.IOEXCEPTION: NAMENODE IS NOT FORMATTED_fa124607857的博客-CSDN博客
Hadoop分别启动namenode,datanode,secondarynamenode等服务_大黑牛的博客-CSDN博客_启动namenode
一旦 Hadoop 集群启动并运行,请检查组件的 web-ui,如下所述:
1. 解压并重命名目录
解压到指定目录: tar zxf spark-3.2.0-bin-hadoop3.2.tgz -C /data/java-service/
重命名目录: mv spark-3.2.0-bin-hadoop3.2 spark
2. 修改配置文件
2.1 环境变量
vim /etc/profie
- # spark
- export SPARK_HOME=/data/java-service/spark
- export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin:$SPARK_HOME/sbin
2.2 spark-env.sh
cd conf
复制模板文件生成新文件: cp spark-env.sh.template spark-env.sh
vim spark-env.sh
- export JAVA_HOME=/usr/java/jdk1.8.0_311-amd64
- export SPARK_MASTER_IP=master
- export SPARK_MASTER_PORT=7077
- export SPARK_WORKER_CORES=1
- export SPARK_WORKER_INSTANCES=1
- export SPARK_WORKER_MEMORY=900M
2.3 workers
复制模板文件并生成新文件:cp workers.template workers
vim workers
- localhost
- master
- slave01
- slave02
3.将配置好的spark文件复制到workerN节点
- scp -r /data/java-service/spark hadoop@slave01:/data/java-service/spark
- scp -r /data/java-service/spark hadoop@slave02:/data/java-service/spark
修改/etc/profile,增加spark相关的配置,如Master节点一样
4.启动
/data/java-service/spark/sbin/start-all.sh
启动验证Spark集群
通过web网页访问:http://master:8080,可以查看到节点信息。
监控:
每个驱动程序都有一个 Web UI,通常在端口 4040 上,显示有关正在运行的任务、执行程序和存储使用情况的信息。只需http://<driver-node>:4040在 Web 浏览器中访问此 UI
下载地址:http://zeppelin.apache.org/download.html
1. 解压并重命名目录
解压到指定目录:
- tar -zxf zeppelin-0.10.0-bin-all.tgz -C /data/java-service/
- # 重命名文件目录:
- mv zeppelin-0.10.0-bin-all/ zeppelin
2. 修改配置文件
2.1 conf/zeppelin-env.sh
- cd /data/java-service/zeppelin/conf
- cp zeppelin-env.sh.template zeppelin-env.sh
vim zeppelin-env.sh
- export JAVA_HOME=/usr/java/jdk1.8.0_311-amd64
- export ZEPPELIN_ADDR=192.168.0.61
- export ZEPPELIN_PORT=9090
启动:
- ./bin/zeppelin-daemon.sh start
- ./bin/zeppelin-daemon.sh stop
访问: 192.168.0.61:9090
配置jdbc:
配置连接mysql: (需要选择mysql-connector)
mysql-connector下载地址:https://mvnrepository.com/artifact/mysql/mysql-connector-java
创建note 测试使用mysql
%jdbc select id from noveltells_net.n_permission limit 10
初次搭建完成导入大量数据后请求 zeppelin 会出现报错:org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 4.0 failed 1 times, most recent failure
问题在于:
大量的小文件会影响Hadoop集群管理或者Spark在处理数据时的稳定性:
1.Spark SQL写Hive或者直接写入HDFS,过多的小文件会对NameNode内存管理等产生巨大的压力,会影响整个集群的稳定运行
2.容易导致task数过多,如果超过参数spark.driver.maxResultSize的配置(默认1g),会抛出类似如下的异常,影响任务的处理
简单粗暴的解决办法:可以通过调大spark.driver.maxResultSize的默认配置来解决问题
参考:Spark SQL 小文件问题产生原因分析以及处理方案 - 知乎
hadoop 启动:
- su hadoop
- cd sbin
- start-all.sh
- ./hadoop-daemon.sh start datanode # 启动namenode
-
- jps # 查看启动项
spark启动
/data/java-service/spark/sbin/start-all.sh
zeppelin启动
- ./bin/zeppelin-daemon.sh start
- ./bin/zeppelin-daemon.sh stop
常用hdfs 命令:
- #创建目录
- hadoop fs -mkdir <hdfs path> --只能一级一级的建目录,父目录不存在的话使用这个命令会报错
- hadoop fs -mkdir -p <hdfs path> --所创建的目录如果父目录不存在就创建该父目录
- #目录或文件是否存在
- hdfs dfs -test -e <hdfs path>
- #删除文件
- hdfs dfs -rm <hdfs file>
- #删除目录
- hdfs dfs -rmr <hdfs path>
- #增加数据
- hdfs dfs -put <local file> <hdfs file> --hdfs file的父目录一定要存在,否则命令不会执行
六、zeppelin 查询mysql遇到的问题:
java.sql.SQLException: GC overhead limit exceeded
at com.mysql.cj.jdbc.exceptions.SQLError.createSQlException(SQLError.java:129)
此问题是 zeppelin内置的spark请求超出了 GC 开销限制,可以通过logs/zeppelin-interpreter-spark.log 查看具体的报错信息,通过spark.driver.memory在 zeppelin gui 中设置来增加驱动程序的可用内存
验证是否生效:
ps -ef | grep zeppelin 查看spark.executor.memory,spark.driver.memory
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。