赞
踩
服務器搭建
更改時區,改爲cst
[sarah@hadoop104 ha]$ sudo vi /etc/sysconfig/clock
zone = Asia/Shshanghai
[sarah@hadoop104 ha]$ sudo rm -rf /etc/localtime
[sarah@hadoop104 ha]$ sudo yum install -y MySQL-python
[sarah@hadoop104 ha]$ sudo ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
[sarah@hadoop104 ha]$ date
Tue May 16 15:25:46 CST 202
注:Extra Packages for Enterprise Linux是为“红帽系”的操作系统提供额外的软件包,适用于RHEL、CentOS和Scientific Linux。相当于是一个软件仓库,大多数rpm包在官方 repository 中是找不到的)
[root@hadoop100 ~]# yum clean all
[root@hadoop100 ~]# yum install -y epel-release
[root@hadoop100 ~]# yum install -y vim
[root@hadoop100 ~]# yum install net-tools -y
[root@hadoop100 ~]# sudo yum -y install wget
[sarah@hadoop102 bin]$ sudo yum install -y rsync
[root@hadoop100 ~]# systemctl stop firewalld
[root@hadoop100 ~]# systemctl disable firewalld.service
注意:在企业开发时,通常单个服务器的防火墙时关闭的。公司整体对外会设置非常安全的防火墙
[root@hadoop100 ~]# useradd sarah
[root@hadoop100 ~]# passwd sarah
[root@hadoop100 ~]# vim /etc/sudoers
修改/etc/sudoers文件,在%wheel这行下面添加一行,如下所示:
## Allow root to run any commands anywhere
root ALL=(ALL) ALL
## Allows people in group wheel to run all commands
%wheel ALL=(ALL) ALL
sarah ALL=(ALL) NOPASSWD:ALL
注意:sarah这一行不要直接放到root行下面,因为所有用户都属于wheel组,你先配置了sarah具有免密功能,但是程序执行到%wheel行时,该功能又被覆盖回需要密码。所以sarah要放到%wheel这行下面。
配置免密root和sarah用戶
(1)在/opt目录下创建module、software文件夹
[root@hadoop100 ~]# mkdir /opt/module
[root@hadoop100 ~]# mkdir /opt/software
(2)修改module、software文件夹的所有者和所属组均为sarah用户
[root@hadoop100 ~]# chown sarah:sarah /opt/module
[root@hadoop100 ~]# chown sarah:sarah /opt/software
(3)查看module、software文件夹的所有者和所属组
[root@hadoop100 ~]# cd /opt/
[root@hadoop100 opt]# ll
总用量 12
drwxr-xr-x. 2 sarah sarah 4096 5月 28 17:18 module
drwxr-xr-x. 2 root root 4096 9月 7 2017 rh
drwxr-xr-x. 2 sarah sarah 4096 5月 28 17:18 software
[root@hadoop100 ~]# rpm -qa | grep -i java | xargs -n1 rpm -e --nodeps
8.修改主机名,以下以hadoop102举例说明
(1)修改主机名称
[root@hadoop100 ~]# vim /etc/hostname
hadoop102
(2)配置Linux服务器主机名称映射hosts文件,打开/etc/hosts
[root@hadoop100 ~]# vim /etc/hosts
添加如下内容
192.168.10.100 hadoop100
192.168.10.101 hadoop101
192.168.10.102 hadoop102
192.168.10.103 hadoop103
192.168.10.104 hadoop104
[root@hadoop100 ~]# reboot
10 修改windows的主机映射文件(hosts文件)
如果操作系统是window10,先拷贝出来,修改保存以后,再覆盖即可
①进入C:\Windows\System32\drivers\etc路径
②拷贝hosts文件到桌面
③打开桌面hosts文件并添加如下内容
192.168.10.100 hadoop100
192.168.10.101 hadoop101
192.168.10.102 hadoop102
192.168.10.103 hadoop103
192.168.10.104 hadoop104
192.168.10.105 hadoop105
192.168.10.106 hadoop106
192.168.10.107 hadoop107
192.168.10.108 hadoop108
④将桌面hosts文件覆盖C:\Windows\System32\drivers\etc路径hosts文件
编写集群分发脚本xsync
①在用的家目录/home/sarah下创建bin文件夹
[sarah@hadoop102 ~]$ mkdir bin
②在/home/sarah/bin目录下创建xsync文件,以便全局调用
[sarah@hadoop102 opt]$ sudo yum install rsync -y
[sarah@hadoop102 ~]$ cd /home/sarah/bin
[sarh@hadoop102 ~]$ vim xsync
在该文件中编写如下代码
#!/bin/bash #1. 判断参数个数 if [ $# -lt 1 ] then echo Not Enough Arguement! exit; fi #2. 遍历集群所有机器 for host in hadoop102 hadoop103 hadoop104 do echo ==================== $host ==================== #3. 遍历所有目录,挨个发送 for file in $@ do #4 判断文件是否存在 if [ -e $file ] then #5. 获取父目录 pdir=$(cd -P $(dirname $file); pwd) #6. 获取当前文件的名称 fname=$(basename $file) ssh $host "mkdir -p $pdir" rsync -av $pdir/$fname $host:$pdir else echo $file does not exists! fi done done
③修改脚本xsync具有执行权限
[sarah@hadoop102 bin]$ chmod +x xsync
④测试脚本
[sarah@hadoop102 bin]$ xsync xsync
(1)hadoop102上生成公钥和私钥
[sarah@hadoop102 .ssh]$ ssh-keygen -t rsa
然后敲(三个回车),就会生成两个文件id_rsa(私钥)、id_rsa.pub(公钥)
(2)将hadoop102公钥拷贝到要免密登录的目标机器上
[sarah@hadoop102 .ssh]$ ssh-copy-id hadoop102
[sarah@hadoop102 .ssh]$ ssh-copy-id hadoop103
[sarah@hadoop102 .ssh]$ ssh-copy-id hadoop104
hadoop103,104依旧如此操作
[sarah@hadoop102 software]# tar -zxvf jdk-8u212-linux-x64.tar.gz -C /opt/module/
(1)新建/etc/profile.d/my_env.sh文件
[sarah@hadoop102 module]# sudo vim /etc/profile.d/my_env.sh
添加如下内容,然后保存(:wq)退出
#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_212
export PATH=$PATH:$JAVA_HOME/bin
(2)让环境变量生效
[sarah@hadoop102 software]$ source /etc/profile.d/my_env.sh
[sarah@hadoop102 module]# java -version
如果能看到以下结果、则Java正常安装
java version "1.8.0_212"
[sarah@hadoop102 module]$ xsync /opt/module/jdk1.8.0_212/
[sarah@hadoop102 module]$ sudo /home/sarah/bin/xsync /etc/profile.d/my_env.sh
[sarah@hadoop103 module]$ source /etc/profile.d/my_env.sh
[sarah@hadoop104 module]$ source /etc/profile.d/my_env.sh
(1)在/home/sarah/bin目录下创建脚本xcall.sh
[sarah@hadoop102 bin]$ vim xcall.sh
(2)在脚本中编写如下内容
#! /bin/bash
for i in hadoop102 hadoop103 hadoop104
do
echo --------- $i ----------
ssh $i "$*"
done
(3)修改脚本执行权限
[sarah@hadoop102 bin]$ chmod 777 xcall.sh
(4)启动脚本
[sarah@hadoop102 bin]$ xcall.sh jps
版本:hadoop-3.1.3.tar.gz
1.1 Hadoop的安装部署
1.1.1 上传安装包并解压
[sarah@hadoop102 software]$ ll
用 量 520600
-rw-rw-r--. 1 sarah sarah 338075860 1月 20 10:55 hadoop-3.1.3.tar.gz
-rw-rw-r--. 1 sarah sarah 195013152 1月 20 10:40 jdk-8u212-linux-x64.tar.gz
[sarah@hadoop102 software]$ tar -zxvf hadoop-3.1.3.tar.gz -C /opt/module/
[sarah@hadoop102 software]$ cd /opt/module/
[sarah@hadoop102 module]$ ll
总用量 0
drwxr-xr-x. 9 sarah sarah 149 9月 12 2019 hadoop-3.1.3
drwxr-xr-x. 7 sarah sarah 245 4月 2 2019 jdk1.8.0_212
[sarah@hadoop102 module]$ mv hadoop-3.1.3/ hadoop
[sarah@hadoop102 module]$ ll
总用量 0
drwxr-xr-x. 9 sarah sarah 149 9月 12 2019 hadoop
drwxr-xr-x. 7 sarah sarah 245 4月 2 2019 jdk1.8.0_212
[sarah@hadoop102 module]$ sudo vim /etc/profile.d/my_env.sh
#HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
4.环境变量配置文件到其他节点
[sarah@hadoop102 module]$ sudo xsync /etc/profile.d/my_env.sh
[sarah@hadoop102 module]$ xcall "source /etc/profile"
1.1.2 修改配置文件
capacity-scheduler.xml
<!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <configuration> <property> <name>yarn.scheduler.capacity.maximum-applications</name> <value>10000</value> <description> Maximum number of applications that can be pending and running. </description> </property> <property> <name>yarn.scheduler.capacity.maximum-am-resource-percent</name> <value>0.8</value> <description> Maximum percent of resources in the cluster which can be used to run application masters i.e. controls number of concurrent running applications. </description> </property> <property> <name>yarn.scheduler.capacity.resource-calculator</name> <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value> <description> The ResourceCalculator implementation to be used to compare Resources in the scheduler. The default i.e. DefaultResourceCalculator only uses Memory while DominantResourceCalculator uses dominant-resource to compare multi-dimensional resources such as Memory, CPU etc. </description> </property> <property> <name>yarn.scheduler.capacity.root.queues</name> <value>default</value> <description> The queues at the this level (root is the root queue). </description> </property> <property> <name>yarn.scheduler.capacity.root.default.capacity</name> <value>100</value> <description>Default queue target capacity.</description> </property> <property> <name>yarn.scheduler.capacity.root.default.user-limit-factor</name> <value>1</value> <description> Default queue user limit a percentage from 0.0 to 1.0. </description> </property> <property> <name>yarn.scheduler.capacity.root.default.maximum-capacity</name> <value>100</value> <description> The maximum capacity of the default queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.default.state</name> <value>RUNNING</value> <description> The state of the default queue. State can be one of RUNNING or STOPPED. </description> </property> <property> <name>yarn.scheduler.capacity.root.default.acl_submit_applications</name> <value>*</value> <description> The ACL of who can submit jobs to the default queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.default.acl_administer_queue</name> <value>*</value> <description> The ACL of who can administer jobs on the default queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.default.acl_application_max_priority</name> <value>*</value> <description> The ACL of who can submit applications with configured priority. For e.g, [user={name} group={name} max_priority={priority} default_priority={priority}] </description> </property> <property> <name>yarn.scheduler.capacity.root.default.maximum-application-lifetime </name> <value>-1</value> <description> Maximum lifetime of an application which is submitted to a queue in seconds. Any value less than or equal to zero will be considered as disabled. This will be a hard time limit for all applications in this queue. If positive value is configured then any application submitted to this queue will be killed after exceeds the configured lifetime. User can also specify lifetime per application basis in application submission context. But user lifetime will be overridden if it exceeds queue maximum lifetime. It is point-in-time configuration. Note : Configuring too low value will result in killing application sooner. This feature is applicable only for leaf queue. </description> </property> <property> <name>yarn.scheduler.capacity.root.default.default-application-lifetime </name> <value>-1</value> <description> Default lifetime of an application which is submitted to a queue in seconds. Any value less than or equal to zero will be considered as disabled. If the user has not submitted application with lifetime value then this value will be taken. It is point-in-time configuration. Note : Default lifetime can't exceed maximum lifetime. This feature is applicable only for leaf queue. </description> </property> <property> <name>yarn.scheduler.capacity.node-locality-delay</name> <value>40</value> <description> Number of missed scheduling opportunities after which the CapacityScheduler attempts to schedule rack-local containers. When setting this parameter, the size of the cluster should be taken into account. We use 40 as the default value, which is approximately the number of nodes in one rack. Note, if this value is -1, the locality constraint in the container request will be ignored, which disables the delay scheduling. </description> </property> <property> <name>yarn.scheduler.capacity.rack-locality-additional-delay</name> <value>-1</value> <description> Number of additional missed scheduling opportunities over the node-locality-delay ones, after which the CapacityScheduler attempts to schedule off-switch containers, instead of rack-local ones. Example: with node-locality-delay=40 and rack-locality-delay=20, the scheduler will attempt rack-local assignments after 40 missed opportunities, and off-switch assignments after 40+20=60 missed opportunities. When setting this parameter, the size of the cluster should be taken into account. We use -1 as the default value, which disables this feature. In this case, the number of missed opportunities for assigning off-switch containers is calculated based on the number of containers and unique locations specified in the resource request, as well as the size of the cluster. </description> </property> <property> <name>yarn.scheduler.capacity.queue-mappings</name> <value></value> <description> A list of mappings that will be used to assign jobs to queues The syntax for this list is [u|g]:[name]:[queue_name][,next mapping]* Typically this list will be used to map users to queues, for example, u:%user:%user maps all users to queues with the same name as the user. </description> </property> <property> <name>yarn.scheduler.capacity.queue-mappings-override.enable</name> <value>false</value> <description> If a queue mapping is present, will it override the value specified by the user? This can be used by administrators to place jobs in queues that are different than the one specified by the user. The default is false. </description> </property> <property> <name>yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments</name> <value>1</value> <description> Controls the number of OFF_SWITCH assignments allowed during a node's heartbeat. Increasing this value can improve scheduling rate for OFF_SWITCH containers. Lower values reduce "clumping" of applications on particular nodes. The default is 1. Legal values are 1-MAX_INT. This config is refreshable. </description> </property> <property> <name>yarn.scheduler.capacity.application.fail-fast</name> <value>false</value> <description> Whether RM should fail during recovery if previous applications' queue is no longer valid. </description> </property> </configuration>
core-site.xml
<configuration> <!-- 指定NameNode的地址 --> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop102:8020</value> </property> <!-- 指定hadoop数据的存储目录 --> <property> <name>hadoop.tmp.dir</name> <value>/opt/module/hadoop/data</value> </property> <!-- 配置HDFS网页登录使用的静态用户为sarah --> <property> <name>hadoop.http.staticuser.user</name> <value>sarah</value> </property> <!-- 配置该sarah(superUser)允许通过代理访问的主机节点 --> <property> <name>hadoop.proxyuser.sarah.hosts</name> <value>*</value> </property> <!-- 配置该sarah(superUser)允许通过代理用户所属组 --> <property> <name>hadoop.proxyuser.sarah.groups</name> <value>*</value> </property> <!-- 配置该sarah(superUser)允许通过代理的用户--> <property> <name>hadoop.proxyuser.sarah.users</name> <value>*</value> </property> </configuration>
hdfs-site.xml
<configuration> <!-- nn web端访问地址--> <property> <name>dfs.namenode.http-address</name> <value>hadoop102:9870</value> </property> <!-- 2nn web端访问地址--> <property> <name>dfs.namenode.secondary.http-address</name> <value>hadoop104:9868</value> </property> <!-- 测试环境指定HDFS副本的数量1 --> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>ipc.maximum.data.length</name> <value>134217728</value> </property> </configuration>
mapred-site.xml
<configuration> <!-- 指定MapReduce程序运行在Yarn上 --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>hadoop102:10020</value> </property> <!-- 历史服务器web端地址 --> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>hadoop102:19888</value> </property> </configuration>
yarn-site.xml
<configuration> <!-- Site specific YARN configuration properties --> <!-- 指定MR走shuffle --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <!-- 指定ResourceManager的地址--> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop103</value> </property> <!-- 环境变量的继承 --> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property> <!-- yarn容器允许分配的最大最小内存 --> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1024</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>10240</value> </property> <!-- yarn容器允许管理的物理内存大小 --> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>40960</value> </property> <!-- 关闭yarn对虚拟内存的限制检查 --> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <!--是否启动一个线程检查每个任务正使用的物理内存量,如果任务超出分配值,则直接将其杀掉,默认是true --> <property> <name>yarn.nodemanager.pmem-check-enabled</name> <value>false</value> </property> <!--是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,则直接将其杀掉,默认是true --> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property> <!-- 设置日志聚集服务器地址 --> <property> <name>yarn.log.server.url</name> <value>http://hadoop102:19888/jobhistory/logs</value> </property> <!-- 设置日志保留时间为7天 --> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>604800</value> </property> <property> <name>yarn.app.mapreduce.am.resource.mb</name> <value>4096</value> </property> </configuration>
workers
hadoop102
hadoop103
hadoop104
[sarah@hadoop102 module]$ hdfs namenode -format
[sarah@hadoop102 sbin]$ start-dfs.sh
通过命令查看HDFS启动情况
[sarah@hadoop102 ~]$ xcall
=============== hadoop102 ===============
17113 Jps
16623 NameNode
=============== hadoop103 ===============
11076 DataNode
11189 Jps
=============== hadoop104 ===============
11472 Jps
11349 SecondaryNameNode
11258 DataNode
=============== hadoop102 ===============
Last login: Thu Jan 20 17:31:56 2022 from hadoop102
[sarah@hadoop103 module]$ start-yarn.sh
通过命令查看启动是否成功
[sarah@hadoop102 ~]$ xcall =============== hadoop102 =============== 18230 Jps 18077 NodeManager 16623 NameNode =============== hadoop103 =============== 11076 DataNode 11926 NodeManager 11608 ResourceManager 12079 Jps =============== hadoop104 =============== 12018 Jps 11349 SecondaryNameNode 11258 DataNode 11866 NodeManager =============== hadoop102 =============== Last login: Thu Jan 20 17:40:11 2022 from hadoop102
1.在用户根目录下编写Hadoop集群启动关闭脚本。
[sarah@hadoop102 module]$ vim /home/sarah/bin/hadoop.sh #!/bin/bash if [ $# -lt 1 ] then echo "No Args Input..." exit ; fi case $1 in "start") echo " ======启动 hadoop集群 =======" echo " --------------- 启动 hdfs ---------------" ssh hadoop102 "/opt/module/hadoop/sbin/start-dfs.sh" echo " --------------- 启动 yarn ---------------" ssh hadoop103 "/opt/module/hadoop/sbin/start-yarn.sh" echo " --------------- 启动 historyserver ---------------" ssh hadoop102 "/opt/module/hadoop/bin/mapred --daemon start historyserver" ;; "stop") echo " ==========关闭 hadoop集群 =========" echo " --------------- 关闭 historyserver ---------------" ssh hadoop102 "/opt/module/hadoop/bin/mapred --daemon stop historyserver" echo " --------------- 关闭 yarn ---------------" ssh hadoop103 "/opt/module/hadoop/sbin/stop-yarn.sh" echo " --------------- 关闭 hdfs ---------------" ssh hadoop102 "/opt/module/hadoop/sbin/stop-dfs.sh" ;; "*") echo "Input Args Error..." ;; esac
2.为脚本文件设置可执行
[sarah@hadoop102 bin]$ sudo chmod +x hadoop.sh
3.使用脚本关闭集群
[sarah@hadoop102 module]$ hadoop.sh stop
Zookeeper版本: apache-zookeeper-3.5.7-bin.tar.gz
1.2 Zookeeper的安装
[sarah@hadoop102 software]$ ll
总用量 529696
-rw-rw-r--. 1 sarah sarah 9311744 1月 20 19:20 apache-zookeeper-3.5.7-bin.tar.gz
-rw-rw-r--. 1 sarah sarah 338075860 1月 20 10:55 hadoop-3.1.3.tar.gz
-rw-rw-r--. 1 sarah sarah 195013152 1月 20 10:40 jdk-8u212-linux-x64.tar.gz
[sarah@hadoop102 software]$ tar -zxvf apache-zookeeper-3.5.7-bin.tar.gz -C /opt/module/
[sarah@hadoop102 module]$ mv apache-zookeeper-3.5.7-bin/ zookeeper-3.5.7
[sarah@hadoop102 module]$ xsync zookeeper-3.5.7/
1.3 Zookeeper的配置
[sarah@hadoop102 zookeeper-3.5.7]$ mkdir zkData
2)在zkData目录下创建一个名为myid的文件,在文件中添加与Server对应的编号。
[sarah@hadoop102 zkData]$ vim myid
2
3)将配置好的zookeeper分发到其他的机器上,并修改myid
[sarah@hadoop102 module]$ xsync zookeeper-3.5.7
[sarah@hadoop103 zkData]$ vim myid
3
[sarah@hadoop104 zkData]$ vim myid
4
[sarah@hadoop102 conf]$ mv zoo_sample.cfg zoo.cfg
2)编辑zoo.cfg文件,配置如下内容
[sarah@hadoop102 conf]$ vim zoo.cfg
……
dataDir=/opt/module/zookeeper-3.5.7/zkData
……
#######################cluster##########################
server.2=hadoop102:2888:3888
server.3=hadoop103:2888:3888
server.4=hadoop104:2888:3888
3)同步zoo.cfg配置文件
[sarah@hadoop102 zookeeper-3.5.7]$ xsync conf
4)配置参数解读
server.A=B:C:D
A:是一个数字,表示这个服务器是第几号服务器。
集群模式下的配置文件myid,存放再zkData目录下,这个文件里面有一个数据就是A的值,Zookeeper启动时读取此文件,拿到里面的数据与zoo.cfg里面的配置信息比较,conger判断到底是哪个server。
B:是这个服务器的地址
C:是这个服务器Follower与集群中的leader服务器交换信息的端口。
D:配置的是用来执行选举时服务器相互通信的端口。集群中的leader服务器挂掉了之后,需要一个端口来重新进行选举,选出一个新的leader。
1.4 Zookeeper启动
[sarah@hadoop102 zookeeper-3.5.7]$ bin/zkServer.sh start
[sarah@hadoop103 zookeeper-3.5.7]$ bin/zkServer.sh start
[sarah@hadoop104 zookeeper-3.5.7]$ bin/zkServer.sh start
[sarah@hadoop102 zookeeper-3.5.7]$ xcall
=============== hadoop102 ===============
30967 Jps
30845 QuorumPeerMain
=============== hadoop103 ===============
21008 Jps
20911 QuorumPeerMain
=============== hadoop104 ===============
21561 Jps
21471 QuorumPeerMain
=============== hadoop102 ===============
Last login: Fri Jan 21 11:37:22 2022 from hadoop102
[sarah@hadoop102 ~]$ xcall "/opt/module/zookeeper-3.5.7/bin/zkServer.sh stop"
4.5 Zookeeper集群的群起群关脚本编写
[sarah@hadoop102 bin]$ vim zk.sh #!/bin/bash if [ $# -lt 1 ] then echo "请输入参数:start|stop|status" && exit fi case $1 in "start"){ for i in hadoop102 hadoop103 hadoop104 do echo =================== zookeeper $i 启动 =================== ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh start" done };; "stop"){ for i in hadoop102 hadoop103 hadoop104 do echo ---------- zookeeper $i 停止 ------------ ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh stop" done };; "status"){ for i in hadoop102 hadoop103 hadoop104 do echo ---------- zookeeper $i 状态 ------------ ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh status" done };; *){ echo "请输入参数:start|stop|status" && exit };; esac
[sarah@hadoop102 ~]$ zk.sh start
[sarah@hadoop102 ~]$ zk.sh status
[sarah@hadoop102 ~]$ zk.sh stop
[sarah@hadoop102 zookeeper-3.5.7]$ bin/zkCli.sh
kakfa版本: kafka_2.12-3.0.0.tgz
[sarah@hadoop102 software]$ ll
-rw-rw-r--. 1 sarah sarah 86486610 3月 10 12:33 kafka_2.12-3.0.0.tgz
[sarah@hadoop102 software]$ tar -zxvf kafka_2.12-3.0.0.tgz -C /opt/module/
[sarah@hadoop102 module]$ mv kafka_2.12-3.0.0 kafka
[sarah@hadoop102 kafka]$ cd config/
[sarah@hadoop102 config]$ vim server.properties
#broker的全局唯一编号,不能重复,只能是数字。 broker.id=102(******) #处理网络请求的线程数量 num.network.threads=3 #用来处理磁盘IO的线程数量 num.io.threads=8 #发送套接字的缓冲区大小 socket.send.buffer.bytes=102400 #接收套接字的缓冲区大小 socket.receive.buffer.bytes=102400 #请求套接字的缓冲区大小 socket.request.max.bytes=104857600 #kafka运行日志(数据)存放的路径,路径不需要提前创建,kafka自动帮你创建,可以配置多个磁盘路径,路径与路径之间可以用","分隔 log.dirs=/opt/module/kafka/datas(******) #topic在当前broker上的分区个数 num.partitions=1 #用来恢复和清理data下数据的线程数量 num.recovery.threads.per.data.dir=1 # 每个topic创建时的副本数,默认时1个副本 offsets.topic.replication.factor=1 #segment文件保留的最长时间,超时将被删除 log.retention.hours=168 #每个segment文件的大小,默认最大1G log.segment.bytes=1073741824 # 检查过期数据的时间,默认5分钟检查一次是否数据过期 log.retention.check.interval.ms=300000 #配置连接Zookeeper集群地址(在zk根目录下创建/kafka,方便管理) zookeeper.connect=hadoop102:2181,hadoop103:2181,hadoop104:2181/kafka(******)
5. 配置环境变量
[sarah@hadoop102 kafka]$ sudo vim /etc/profile.d/my_env.sh
#KAFKA_HOME
export KAFKA_HOME=/opt/module/kafka
export PATH=$PATH:$KAFKA_HOME/bin
[sarah@hadoop102 kafka]$ source /etc/profile
[sarah@hadoop102 kafka]$ xsync /etc/profile.d/my_env.sh ==================== hadoop102 ==================== sending incremental file list sent 47 bytes received 12 bytes 39.33 bytes/sec total size is 371 speedup is 6.29 ==================== hadoop103 ==================== sending incremental file list my_env.sh rsync: mkstemp "/etc/profile.d/.my_env.sh.Sd7MUA" failed: Permission denied (13) sent 465 bytes received 126 bytes 394.00 bytes/sec total size is 371 speedup is 0.63 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2] ==================== hadoop104 ==================== sending incremental file list my_env.sh rsync: mkstemp "/etc/profile.d/.my_env.sh.vb8jRj" failed: Permission denied (13) sent 465 bytes received 126 bytes 1,182.00 bytes/sec total size is 371 speedup is 0.63 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2],
[sarah@hadoop102 kafka]$ sudo xsync /etc/profile.d/my_env.sh
sudo: xsync:找不到命令
[sarah@hadoop102 kafka]$ sudo cp /home/sarah/bin/xsync /usr/bin/
[sarah@hadoop102 kafka]$ sudo xsync /etc/profile.d/my_env.sh
[sarah@hadoop102 kafka]$ xcall source /etc/profile
[sarah@hadoop102 module]$ xsync kafka/
[sarah@hadoop103 kafka]$ vim config/server.properties
broker.id=103
[sarah@hadoop104 kafka]$ vim config/server.properties
broker.id=104
[sarah@hadoop102 kafka]$ zk.sh start
② 依次在hadoop102、hadoop103、hadoop104节点上启动kafka
[sarah@hadoop102 kafka]$ bin/kafka-server-start.sh -daemon config/server.properties [sarah@hadoop103 kafka]$ bin/kafka-server-start.sh -daemon config/server.properties [sarah@hadoop104 kafka]$ bin/kafka-server-start.sh -daemon config/server.properties
[sarah@hadoop102 kafka]$ bin/kafka-server-stop.sh
[sarah@hadoop103 kafka]$ bin/kafka-server-stop.sh
[sarah@hadoop104 kafka]$ bin/kafka-server-stop.sh
#! /bin/bash if (($#==0)); then echo -e "请输入参数:\n start 启动kafka集群;\n stop 停止kafka集群;\n" && exit fi case $1 in "start") for host in hadoop103 hadoop102 hadoop104 do echo "---------- $1 $host 的kafka ----------" ssh $host "/opt/module/kafka/bin/kafka-server-start.sh -daemon /opt/module/kafka/config/server.properties" done ;; "stop") for host in hadoop103 hadoop102 hadoop104 do echo "---------- $1 $host 的kafka ----------" ssh $host "/opt/module/kafka/bin/kafka-server-stop.sh /opt/module/kafka/config/server.properties" done ;; *) echo -e "---------- 请输入正确的参数 ----------\n" echo -e "start 启动kafka集群;\n stop 停止kafka集群;\n" && exit ;; esac
[sarah@hadoop102 bin]$ chmod +x kafka.sh
注意:停止Kafka集群时,一定要等Kafka所有节点进程全部停止后再停止Zookeeper集群。因为Zookeeper集群当中记录着Kafka集群相关信息,Zookeeper集群一旦先停止,Kafka集群就没有办法再获取停止进程的信息,只能手动杀死Kafka进程了。
3. 使用脚本启动关闭集群
[sarah@hadoop102 kafka]$ kafka.sh start
[sarah@hadoop102 kafka]$ kafka.sh stop
flume版本:apache-flume-1.9.0-bin.tar.gz
[sarah@hadoop102 software]$ tar -zxf apache-flume-1.9.0-bin.tar.gz -C /opt/module/
[sarah@hadoop102 module]$ mv /opt/module/apache-flume-1.9.0-bin /opt/module/flume
[sarah@hadoop102 module]$ rm /opt/module/flume/lib/guava-11.0.2.jar
项目经验之Flume 堆内存调整
生产环境中,flume的堆内存通常设置为4G或者更高,配置方式如下(虚拟机环境暂不配置):
[sarah@hadoop102 flume]$ vim conf/flume-env.sh
……
export JAVA_OPTS="-Xms4096m -Xmx4096m -Dcom.sun.management.jmxremote"
注:
① -Xms表示JVM Heap(堆内存)最小尺寸,初始分配;
② -Xmx 表示JVM Heap(堆内存)最大允许的尺寸,按需分配。
版本: mysql-5.7.28-1.el7.x86_64.rpm-bundle.tar
2.2.1检查本机环境
1.检查当前系统是否安装过Mysql,如果存在就是用下列命令移除,如果不存在则忽略。
[x hadoop102 module]$ rpm -qa|grep mariadb
mariadb-libs-5.5.56-2.el7.x86_64 //如果存在通过如下命令卸载
[x hadoop102 module]$ sudo rpm -e --nodeps mariadb-libs //用此命令卸载mariadb
2.2.2上传解压Mysql安装包
1.将MySQL安装包上传到/opt/software目录下
[x hadoop102 software]$ ll
总用量 528384
-rw-r--r--. 1 root root 609556480 3月 21 15:41 mysql-5.7.28-1.el7.x86_64.rpm-bundle.tar
2.解压MySQL安装包资源到/opt/software 下新创建的mysql_jars目录
[x hadoop102 software]$ mkdir /opt/software/mysql_jars
[x hadoop102 software]$ tar -xf /opt/software/mysql-5.7.28-1.el7.x86_64.rpm-bundle.tar -C /opt/software/mysql_jars
3.查看mysql_jars目录下解压后的文件如下:
[x hadoop102 software]$ cd /opt/software/mysql_jars
[x hadoop102 mysql_jars]$ ll
总用量 595272
-rw-r--r--. 1 x x 45109364 9月 30 2019 mysql-community-client-5.7.28-1.el7.x86_64.rpm
-rw-r--r--. 1 x x 318768 9月 30 2019 mysql-community-common-5.7.28-1.el7.x86_64.rpm
-rw-r--r--. 1 x x 7037096 9月 30 2019 mysql-community-devel-5.7.28-1.el7.x86_64.rpm
-rw-r--r--. 1 x x 49329100 9月 30 2019 mysql-community-embedded-5.7.28-1.el7.x86_64.rpm
-rw-r--r--. 1 x x 23354908 9月 30 2019 mysql-community-embedded-compat-5.7.28-1.el7.x86_64.rpm
-rw-r--r--. 1 x x 136837816 9月 30 2019 mysql-community-embedded-devel-5.7.28-1.el7.x86_64.rpm
-rw-r--r--. 1 x x 4374364 9月 30 2019 mysql-community-libs-5.7.28-1.el7.x86_64.rpm
-rw-r--r--. 1 x x 1353312 9月 30 2019 mysql-community-libs-compat-5.7.28-1.el7.x86_64.rpm
-rw-r--r--. 1 x x 208694824 9月 30 2019 mysql-community-server-5.7.28-1.el7.x86_64.rpm
-rw-r--r--. 1 x x 133129992 9月 30 2019 mysql-community-test-5.7.28-1.el7.x86_64.rpm
2.2.3 安装Mysql
1.在/opt/software/mysql_jars目录下执行rpm安装,严格按照如下顺序执行
sudo rpm -ivh mysql-community-common-5.7.28-1.el7.x86_64.rpm
sudo rpm -ivh mysql-community-libs-5.7.28-1.el7.x86_64.rpm
sudo rpm -ivh mysql-community-libs-compat-5.7.28-1.el7.x86_64.rpm
sudo rpm -ivh mysql-community-client-5.7.28-1.el7.x86_64.rpm
sudo rpm -ivh mysql-community-server-5.7.28-1.el7.x86_64.rpm
注意: linux采用最小化安装会存在问题。
2.删除mysql的数据存储目录下的所有数据,目录地址在/etc/my.cnf文件中datadir指向的目录下的所有内容
如果有内容的情况下:
·查看datadir的值:
[mysqld]
datadir=/var/lib/mysql
·删除/var/lib/mysql目录下的所有内容:
[x hadoop102 hive]$ cd /var/lib/mysql
[x hadoop102 mysql]$ sudo rm -rf ./* //千万注意执行命令的位置
3.初始化数据库
[x hadoop102 module]$ sudo mysqld --initialize --user=mysql
4.初始化完成后,查看临时生成的root用户的密码,也是首次登录msql的密码
[x hadoop102 module]$ sudo cat /var/log/mysqld.log
2021-10-18T08:50:32.172049Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2021-10-18T08:50:32.364322Z 0 [Warning] InnoDB: New log files created, LSN=45790
2021-10-18T08:50:32.397350Z 0 [Warning] InnoDB: Creating foreign key constraint system tables.
2021-10-18T08:50:32.453522Z 0 [Warning] No existing UUID has been found, so we assume that this is the first time that this server has been started. Generating a new UUID: 73e2af3c-2ff0-11ec-af41-000c29830057.
2021-10-18T08:50:32.454765Z 0 [Warning] Gtid table is not ready to be used. Table 'mysql.gtid_executed' cannot be opened.
2021-10-18T08:50:32.978960Z 0 [Warning] CA certificate ca.pem is self signed.
2021-10-18T08:50:33.314317Z 1 [Note] A temporary password is generated for root@localhost: OU+*c.C9FZy;
5.启动MySQL服务
[x hadoop102 module]$ sudo systemctl start mysqld
6.登录MySQL数据库
[x hadoop102 module]$ mysql -uroot -p
Enter password: (你的暂时密码) //输入临时生成的密码
7.必须先修改root用户的密码,否则执行其他的操作会报错
mysql> set password = password("新密码");
8.修改mysql库下的user表中的root用户允许任意ip连接
mysql> update mysql.user set host='%' where user='root';
9.刷新,使得修改生效
mysql> flush privileges;
10.退出
mysql> quit;
安裝部署Maxwell
maxwell版本:maxwell-1.29.2.tar.gz
[x hadoop102 software]$ ll
总用量 1406064
-rw-rw-r--. 1 x x 65510398 3月 1 12:23 maxwell-1.29.2.tar.gz
[x hadoop102 software]$ tar -zxvf maxwell-1.29.2.tar.gz -C /opt/module/
1. 修改MySQL的配置文件/etc/my.cnf
[x hadoop102 software]$ sudo vim /etc/my.cnf
[mysqld]
#数据库id
server-id=1
#启动Binlog,该参数的值会作为binlog的文件名前缀
log-bin=mysql-bin
#binlog类型,maxwell要求为row类型
binlog_format=row
#启动binlog的数据库,需根据实际情况修改配置
binlog-do-db=gmall
[xhadoop102 software]$ sudo systemctl restart mysqld
创建Maxwell所需数据库和用户
Maxwell需要在MySQL中存储其运行过程中的所需数据,如:binlog同步的断点位置等,所以我们需要在MySQL中为其创建数据库和用户。
mysql> create database maxwell;
mysql> create user 'maxwell'@'%' identified by 'maxwell';
Query OK, 0 rows affected (0.00 sec)
mysql> set global validate_password_policy=0;
mysql> set global validate_password_length=4;
報錯:
mysql> set global validate_password_policy=0;
ERROR 1193 (HY000): Unknown system variable 'validate_password_policy'
mysql> set global validate_password_policy=0;
ERROR 1193 (HY000): Unknown system variable 'validate_password_policy'
解決辦法:
sudo vim /etc/my.cnf
添加
plugin-load-add=validate_password.so
validate-password=FORCE_PLUS_PERMANENT
#数据库id server-id=1 ##启动Binlog,该参数的值会作为binlog的文件名前缀 log-bin=mysql-bin ##binlog类型,maxwell要求为row类型 binlog_format=row ##启动binlog的数据库,需根据实际情况修改配置 binlog-do-db=student # # # Remove leading # and set to the amount of RAM for the most important data # cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%. # innodb_buffer_pool_size = 128M # # Remove leading # to turn on a very important data integrity option: logging # changes to the binary log between backups. # log_bin # # Remove leading # to set options mainly useful for reporting servers. # The server defaults are faster for transactions and fast SELECTs. # Adjust sizes as needed, experiment to find the optimal values. # join_buffer_size = 128M # sort_buffer_size = 2M # read_rnd_buffer_size = 2M datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock # Disabling symbolic-links is recommended to prevent assorted security risks symbolic-links=0 log-error=/var/log/mysqld.log pid-file=/var/run/mysqld/mysqld.pid plugin-load-add=validate_password.so validate-password=FORCE_PLUS_PERMANENT
重啟MySQL
[sarah@hadoop102 maxwell-1.29.2]$ sudo systemctl restart mysqld
mysql> grant all on maxwell.* to 'maxwell'@'%';
Query OK, 0 rows affected (0.00 sec)
mysql> grant select, replication client, replication slave on *.* to 'maxwell'@'%';
Query OK, 0 rows affected (0.00 sec)
[x hadoop102 maxwell-1.29.2]$ mv config.properties.example config.properties
[x hadoop102 maxwell-1.29.2]$ vim config.properties #Maxwell数据发送目的地,可选配置有stdout|file|kafka|kinesis|pubsub|sqs|rabbitmq|redis producer=kafka #目标Kafka集群地址 kafka.bootstrap.servers=hadoop102:9092,hadoop103:9092,hadoop104:9092 #目标Kafka topic,可静态配置,例如:maxwell,也可动态配置,例如:%{database}_%{table} kafka_topic=maxwell #MySQL相关配置 host=hadoop102 user=maxwell password=maxwell jdbc_options=useSSL=false&serverTimezone=Asia/Shanghai filter=exclude:*.*,include:db_core.tb_readonly_invoice_data #指定kafka_producer 按照表名分区 producer_partition_by=table
8.若Maxwell发送数据的目的地是kafka集群,需要首先将kafka集群启动。
9. 启动Maxwell
[x hadoop102 maxwell]$ bin/maxwell --config config.properties --daemon
[x hadoop102 maxwell]$ ps -ef | grep maxwell | grep -v grep | awk '{print $2}' | xargs kill -9
[x hadoop102 kafka]$ bin/kafka-console-consumer.sh --bootstrap-server hadoop102:9092 --topic maxwell
12 Maxwell启动和停止
#!/bin/bash MAXWELL_HOME=/opt/module/maxwell status_maxwell(){ result=`ps -ef | grep com.zendesk.maxwell.Maxwell | grep -v grep | wc -l` return $result } start_maxwell(){ status_maxwell if [[ $? -lt 1 ]]; then echo "启动Maxwell" $MAXWELL_HOME/bin/maxwell --config $MAXWELL_HOME/config.properties --daemon else echo "Maxwell正在运行" fi } stop_maxwell(){ status_maxwell if [[ $? -gt 0 ]]; then echo "停止Maxwell" ps -ef | grep com.zendesk.maxwell.Maxwell | grep -v grep | awk '{print $2}' | xargs kill -9 else echo "Maxwell未在运行" fi } case $1 in start ) start_maxwell ;; stop ) stop_maxwell ;; restart ) stop_maxwell start_maxwell ;; esac
13.Maxwell提供了bootstrap功能来进行历史数据的全量同步(前提要部署一个maxwell),命令如下
[x hadoop102 maxwell]$ bin/maxwell-bootstrap --database gmall --table user_info --config config.properties
[x hadoop102 maxwell]$ vim config.properties log_level=info producer=kafka kafka.bootstrap.servers=hadoop102:9092,hadoop103:9092 #kafka topic修改配置 kafka_topic=topic_db # mysql login info host=hadoop102 user=maxwell password=maxwell jdbc_options=useSSL=false&serverTimezone=Asia/Shanghai filter=exclude:*.*,include:db_core.tb_readonly_invoice_data #指定kafka_producer 按照表名分区 producer_partition_by=table
[x hadoop102 bin]$ maxwell.sh restart
[x hadoop102 kafka]$ bin/kafka-console-consumer.sh --bootstrap-server hadoop102:9092 --topic topic_db
③ 生成模拟数据
[x hadoop102 db_log]$ java -jar gmall2020-mock-db-2021-11-14.jar
④ 观察kafka 消费者能否消费到数据
{“database”:“gmall”,“table”:“cart_info”,“type”:“update”,“ts”:1592270938,“xid”:13090,“xoffset”:1573,“data”:{“id”:100924,“user_id”:“93”,“sku_id”:16,“cart_price”:4488.00,“sku_num”:1,“img_url”:“http://47.93.148.192:8080/group1/M00/00/02/rBHu8l-sklaALrngAAHGDqdpFtU741.jpg”,“sku_name”:“华为 HUAWEI P40 麒麟990 5G SoC芯片 5000万超感知徕卡三摄 30倍数字变焦 8GB+128GB亮黑色全网通5G手机”,“is_checked”:null,“create_time”:“2020-06-14 09:28:57”,“operate_time”:null,“is_ordered”:1,“order_time”:“2021-10-17 09:28:58”,“source_type”:“2401”,“source_id”:null},“old”:{“is_ordered”:0,“order_time”:null}}
DataX版本:最新版
[x hadoop102 software]$ ll
总用量 1406064
-rw-rw-r--. 1 atguigu atguigu 829372407 2月 27 12:00 datax.tar.gz
[x hadoop102 datax]$ tar -zxvf datax.tar.gz -C /opt/module
[x hadoop102 datax]$ python bin/datax.py job/job.json
注意:官网下载的Hive3.1.2和Spark3.0.0默认是不兼容的。因为Hive3.1.2支持的Spark版本是2.4.5,所以需要我们重新编译Hive3.1.2版本。
编译步骤:官网下载Hive3.1.2源码,修改pom文件中引用的Spark版本为3.0.0,如果编译通过,直接打包获取jar包。如果报错,就根据提示,修改相关方法,直到不报错,打包获取jar包。
hive版本:apache-hive-3.1.2-bin.tar.gz
1.hive解压安装
1)把apache-hive-3.1.2-bin.tar.gz上传到linux的/opt/software目录下
2)将/opt/software/目录下的apache-hive-3.1.2-bin.tar.gz到/opt/module/目录下面
[hadoop102 software]$ tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /opt/module/
3)修改解压后的目录名称为hive
[hadoop102 module]$ mv apache-hive-3.1.2-bin/ /opt/module/hive
4)修改/etc/profile.d/my_env.sh文件,将hive的/bin目录添加到环境变量
[hadoop102 hive]$ sudo vim /etc/profile.d/my_env.sh
……
#HIVE_HOME
export HIVE_HOME=/opt/module/hive
export PATH=$PATH:$HIVE_HOME/bin
[xhadoop102 hive]$ source /etc/profile
3.配置Hive元数据库为MySql
[xhadoop102 software]$ cp mysql-connector-java-5.1.37.jar /opt/module/hive/lib
2)配置Metastore到MySql
在/opt/module/hive/conf目录下新建hive-site.xml文件(新建的配置文件中的配置会覆盖默认配置)
[xhadoop102 hive]$ vim conf/hive-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- jdbc连接的URL --> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hadoop102:3306/metastore?useSSL=false</value> </property> <!-- jdbc连接的Driver--> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <!-- jdbc连接的username--> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <!-- jdbc连接的password --> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>你的密码</value> </property> <!-- Hive默认在HDFS的工作目录 --> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> </property> <!-- Hive元数据存储的验证 --> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <!-- 元数据存储授权 --> <property> <name>hive.metastore.event.db.notification.api.auth</name> <value>false</value> </property> </configuration>
4.Hive初始化元数据库
在mysql中创建hive存储元数据的数据库metastore,再通过hive的初始化元数据库操作创建表
1)登陆MySQL
[xhadoop102 module]$ mysql -uroot -p你的密码
2)新建Hive元数据库
mysql> create database metastore;
mysql> quit;
3)初始化Hive元数据库
[xhadoop102 hive]$ bin/schematool -initSchema -dbType mysql -verbose
5.启动Hive
1)启动Hive
[xhadoop102 hive]$ bin/hive
2)使用Hive
hive> show databases; // 查看所有的数据库
hive> show tables; // 查看所有的表,刚才创建的表test_derby是否存在?为什么?
hive> create table test_mysql (id int); // 创建test_mysql表,一个字段为id,类型为int
hive> insert into test_mysql values(1002); // 向表test_mysql中插入数据
hive> select * from test_mysql; // 查看test2表
3)开启另一个窗口测试,是否支持客户端并发访问
[xhadoop102 hvie]$ bin/hive
hive> show tables;
hive> select * from test_mysql;
5 修改元数据库字符集
Hive元数据库的字符集默认为Latin1不支持中文字符,故若建表语句中包含中文注释,会出现乱码现象。
1)修改Hive元数据库中存储注释的字段的字符集为utf-8
① 字段注释
mysql> alter table metastore.COLUMNS_V2 modify column COMMENT varchar(256) character set utf8;
② 表注释
mysql> alter table metastore.TABLE_PARAMS modify column PARAM_VALUE mediumtext character set utf8;
2)修改hive-site.xml中JDBC URL,如下
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoop102:3306/metastore?useSSL=false&useUnicode=true&characterEncoding=UTF-8</value>
</property>
最終hive-site.xml文件的內容
<configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hadoop102:3306/metastore?useSSL=false&useUnicode=true&characterEncoding=UTF-8</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>9FPtZv7ibqIl</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> </property> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <property> <name>hive.server2.thrift.port</name> <value>10000</value> </property> <property> <name>hive.server2.thrift.bind.host</name> <value>hadoop102</value> </property> <property> <name>hive.metastore.event.db.notification.api.auth</name> <value>false</value> </property> <property> <name>hive.cli.print.header</name> <value>true</value> </property> <property> <name>hive.cli.print.current.db</name> <value>true</value> </property> <!--Spark依赖位置(注意:端口号8020必须和namenode的端口号一致)--> <property> <name>spark.yarn.jars</name> <value>hdfs://hadoop102:8020/spark-jars/*</value> </property> <!--Hive执行引擎--> <property> <name>hive.execution.engine</name> <value>spark</value> </property> <!--指定元数据序列化与反序列化器--> <property> <name>metastore.storage.schema.reader.impl</name> <value>org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader</value> </property> <!--hive和spark連接超時時間--> <property> <name>hive.spark.client.server.connect.timeout</name> <value>100000</value> </property> </configuration>
Hive on Spark配置
spark版本:spark-3.0.0-bin-hadoop3.2.tgz
(1)兼容性说明
注意:官网下载的Hive3.1.2和Spark3.0.0默认是不兼容的。因为Hive3.1.2支持的Spark版本是2.4.5,所以需要我们重新编译Hive3.1.2版本。
编译步骤:官网下载Hive3.1.2源码,修改pom文件中引用的Spark版本为3.0.0,如果编译通过,直接打包获取jar包。如果报错,就根据提示,修改相关方法,直到不报错,打包获取jar包。
(2)在Hive所在节点部署Spark
如果之前已经部署了Spark,则该步骤可以跳过
1Spark官网下载jar包地址:
http://spark.apache.org/downloads.html
2上传并解压解压spark-3.0.0-bin-hadoop3.2.tgz
[sarah@hadoop102 software]$ tar -zxvf spark-3.0.0-bin-hadoop3.2.tgz -C /opt/module/
[sarah@hadoop102 software]$ mv /opt/module/spark-3.0.0-bin-hadoop3.2 /opt/module/spark
(3)配置SPARK_HOME环境变量
[sarah@hadoop102 software]$ sudo vim /etc/profile.d/my_env.sh
添加如下内容
# SPARK_HOME
export SPARK_HOME=/opt/module/spark
export PATH=$PATH:$SPARK_HOME/bin
source 使其生效
[sarah@hadoop102 software]$ source /etc/profile.d/my_env.sh
(4)在hive中创建spark配置文件
[sarah@hadoop102 software]$ vim /opt/module/hive/conf/spark-defaults.conf
添加如下内容(在执行任务时,会根据如下参数执行)
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs://hadoop102:8020/spark-history
spark.executor.memory 1g
spark.driver.memory 1g
在HDFS创建如下路径,用于存储历史日志
[sarah@hadoop102 software]$ `hadoop fs -mkdir /spark-history`
(5)向HDFS上传Spark纯净版jar包
说明1:由于Spark3.0.0非纯净版默认支持的是hive2.3.7版本,直接使用会和安装的Hive3.1.2出现兼容性问题。所以采用Spark纯净版jar包,不包含hadoop和hive相关依赖,避免冲突。
说明2:Hive任务最终由Spark来执行,Spark任务资源分配由Yarn来调度,该任务有可能被分配到集群的任何一个节点。所以需要将Spark的依赖上传到HDFS集群路径,这样集群中任何一个节点都能获取到。
1上传并解压spark-3.0.0-bin-without-hadoop.tgz
[sarah@hadoop102 software]$ tar -zxvf /opt/software/spark-3.0.0-bin-without-hadoop.tgz
2上传Spark纯净版jar包到HDFS
[sarah@hadoop102 software]$ hadoop fs -mkdir /spark-jars
[sarah@hadoop102 software]$ hadoop fs -put spark-3.0.0-bin-without-hadoop/jars/* /spark-jars
(6)修改hive-site.xml文件
[sarah@hadoop102 ~]$ vim /opt/module/hive/conf/hive-site.xml 添加如下内容 <!--Spark依赖位置(注意:端口号8020必须和namenode的端口号一致)--> <property> <name>spark.yarn.jars</name> <value>hdfs://hadoop102:8020/spark-jars/*</value> </property> <!--Hive执行引擎--> <property> <name>hive.execution.engine</name> <value>spark</value> </property> <!--指定元数据序列化与反序列化器--> <property> <name>metastore.storage.schema.reader.impl</name> <value>org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader</value> </property>
3)Hive on Spark测试
(1)启动hive客户端
[sarah@hadoop102 hive]$ bin/hive
(2)创建一张测试表
hive (default)> create table student(id int, name string);
(3)通过insert测试效果
hive (default)> insert into table student values(1,'abc');
若结果如下,则说明配置成功
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。