赞
踩
序号 | bigdata-001 | bigdata-002 | bigdata-003 | bigdata-004 | bigdata-005 |
---|---|---|---|---|---|
MySQL-8.0.31 | mysql | ||||
Datax | Datax | Datax | Datax | Datax | Datax |
Spark-3.3.1 | Spark | Spark | Spark | Spark | Spark |
Hive-3.1.3 | Hive | Hive |
hive官网: https://hive.apache.org/
hive安装包下载:http://archive.apache.org/dist/hive/
spark官网:https://spark.apache.org/
spark安装包下载:https://www.apache.org/dyn/closer.lua/spark/spark-3.3.1/
注意:官网下载的Hive3.1.3和Spark3.3.1默认是不兼容的。因为Hive3.1.3支持的Spark版本是2.4.5,所以需要我们重新编译Hive3.1.3版本。
Hadoop环境安装详见本博客最全Hadoop实际生产集群高可用搭建
[hadoop@hadoop1 software]$ tar -zxvf /data/software/apache-hive-3.1.3-bin.tar.gz -C /data/module/
[hadoop@hadoop1 software]$ mv /data/module/apache-hive-3.1.3-bin/ /data/module/hive-3.1.3
[hadoop@hadoop1 software]$ sudo vim /etc/profile.d/my_env.sh
#HIVE_HOME
export HIVE_HOME=/data/module/hive-3.1.3
export PATH=$PATH:$HIVE_HOME/bin
export PATH JAVA_HOME HADOOP_HOME HIVE_HOME
[hadoop@hadoop1 software]$ cp /data/software/mysql-connector-java-5.1.48.jar $HIVE_HOME/lib
[hadoop@hadoop1 software]$ vim $HIVE_HOME/conf/hive-site.xml
添加如下内容
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- jdbc连接的URL --> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://xxx:3306/metastore?useSSL=false&createDatabaseIfNotExist=true&characterEncoding=UTF-8</value> </property> <!-- jdbc连接的Driver--> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <!-- jdbc连接的username--> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>xxx</value> </property> <!-- jdbc连接的password --> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>xxx</value> </property> <!-- Hive默认在HDFS的工作目录 --> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> </property> <!-- Hive元数据存储的验证 --> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <!-- hive表元数据读取不到--> <property> <name>metastore.storage.schema.reader.impl</name> <value>org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader</value> </property> <!-- 元数据存储授权 --> <property> <name>hive.metastore.event.db.notification.api.auth</name> <value>false</value> </property> <!-- 打印当前库和表头 --> <property> <name>hive.cli.print.header</name> <value>true</value> </property> <property> <name>hive.cli.print.current.db</name> <value>true</value> </property> <!-- 指定存储元数据要连接的地址 --> <property> <name>hive.metastore.uris</name> <value>thrift://xxx:9083,thrift://xxx1:9083</value> </property> <!-- 指定hiveserver2连接的host --> <property> <name>hive.server2.thrift.bind.host</name> <value>xxx</value> </property> <!-- 指定hiveserver2连接的端口号 --> <property> <name>hive.server2.thrift.port</name> <value>10000</value> </property> <property> <name>hive.server2.enable.doAs </name> <value>false</value> </property> <!--Spark依赖位置(注意:端口号8020必须和namenode的端口号一致)--> <property> <name>spark.yarn.jars</name> <value>hdfs://hadoopcluster/spark-jars/*</value> </property> <!--Hive执行引擎--> <property> <name>hive.execution.engine</name> <value>spark</value> </property> <!--配置动态分配spark资源--> <property> <name>spark.dynamicAllocation.enabled</name> <value>true</value> </property> <!--Hive和Spark连接超时时间--> <property> <name>hive.spark.client.connect.timeout</name> <value>100000ms</value> </property> <property> <name>hive.zookeeper.client.port</name> <value>2181</value> </property> <property> <name>hive.zookeeper.quorum</name> <value>xxxxx</value> </property> <property> <name>hive.server2.support.dynamic.service.discovery</name> <value>true</value> </property> <property> <name>hive.server2.zookeeper.namespace</name> <value>hiveserver2_zk</value> </property> <!-- <property> <name>hive.exec.post.hooks</name> <value>org.apache.atlas.hive.hook.HiveHook</value> </property> --> <!--hiveserver2启动等待时间--> <property> <name>hive.server2.sleep.interval.between.start.attempts</name> <value>2s</value> <description> Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is msec if not specified. The time should be in between 0 msec (inclusive) and 9223372036854775807 msec (inclusive). Amount of time to sleep between HiveServer2 start attempts. Primarily meant for tests </description> </property> <!--不显示 info 信息--> <property> <name>hive.server2.logging.operation.enabled</name> <value>false</value> </property> <!-- <property> <name>hive.tez.container.size</name> <value>10240</value> </property> <property> <name>hive.server2.enable.doAs</name> <value>true</value> </property> --> <property> <name>hive_timeline_logging_enabled</name> <value>true</value> </property> <!--添加钩子,采集数据到tez-ui --> <!-- <property> <name>hive.exec.failure.hooks</name> <value>org.apache.hadoop.hive.ql.hooks.ATSHook</value> </property> <property> <name>hive.exec.post.hooks</name> <value>org.apache.hadoop.hive.ql.hooks.ATSHook</value> </property> <property> <name>hive.exec.pre.hooks</name> <value>org.apache.hadoop.hive.ql.hooks.ATSHook</value> </property> --> <property> <name>hive.reloadable.aux.jars.path</name> <value>/data/module/hive-3.1.3/jars</value> </property> <!--配置hiveserver2密码验证 --> <!-- <property> <name>hive.security.authorization.enabled</name> <value>true</value> </property> <property> <name>hive.server2.authentication</name> <value>CUSTOM</value> </property> --> <!--这是hive超级用户 --> <property> <name>hive.users.in.admin.role</name> <value>hadoop</value> </property> </configuration>
[hadoop@hadoop1 software]$ mysql -uroot -pxxx
mysql> create database metastore;
mysql> quit;
[hadoop@hadoop1 software]$ schematool -initSchema -dbType mysql -verbose
4) 修改元数据库字符集
Hive元数据库的字符集默认为Latin1,由于其不支持中文字符,故若建表语句中包含中文注释,会出现乱码现象。如需解决乱码问题,须做以下修改。
修改Hive元数据库中存储注释的字段的字符集为utf-8
//字段注释
mysql> alter table COLUMNS_V2 modify column COMMENT varchar(256) character set utf8;
//表注释
mysql> alter table TABLE_PARAMS modify column PARAM_VALUE mediumtext character set utf8;
//退出
quit;
[hadoop@hadoop1 hive]$ bin/hive --service hiveserver2
[hadoop@hadoop1 hive]$ bin/beeline -u jdbc:hive2://hadoop1:10000 -n hadoop
Connecting to jdbc:hive2://hadoop1:10000
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.2 by Apache Hive
0: jdbc:hive2://hadoop1:10000>
将spark-3.3.1-bin-hadoop3.tgz文件上传到Linux并解压缩,放置在指定位置,路径中不要包含中文或空格
tar -zxvf spark-3.3.1-bin-hadoop3.tgz -C /data/module
cd /data/module
mv spark-3.3.1-bin-hadoop3.2 spark-3.3.1
1)进入解压缩后的路径,执行如下指令
bin/spark-shell
[hadoop@hadoop1 software]$ sudo vim /etc/profile.d/my_env.sh
添加如下内容
# SPARK_HOME
export SPARK_HOME=/data/module/spark-3.3.1
export PATH=$PATH:$SPARK_HOME/bin
source 使其生效
[hadoop@hadoop1 software]$ source /etc/profile.d/my_env.sh
[hadoop@hadoop1 software]$ vim /data/module/spark-3.3.1
/conf/spark-defaults.conf
添加如下内容(在执行任务时,会根据如下参数执行)
spark.master yarn spark.eventLog.enabled true spark.eventLog.dir hdfs://yourhadoopcluster/spark-history spark.executor.cores 1 spark.executor.memory 4g spark.executor.memoryOverhead 2g spark.driver.memory 4g spark.driver.memoryOverhead 2g spark.dynamicAllocation.enabled true spark.shuffle.service.enabled true spark.dynamicAllocation.executorIdleTimeout 60s spark.dynamicAllocation.initialExecutors 1 spark.dynamicAllocation.minExecutors 1 spark.dynamicAllocation.maxExecutors 12 spark.dynamicAllocation.schedulerBacklogTimeout 1s spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 5s spark.dynamicAllocation.cachedExecutorIdleTimeout 30s spark.shuffle.useOldFetchProtocol true spark.history.fs.cleaner.enabled true spark.history.fs.cleaner.interval 1d spark.history.fs.cleaner.maxAge 7d spark.hadoop.orc.overwrite.output.file true spark.executor.extraJavaOptions=-Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8 spark.driver.extraJavaOptions=-Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8
在HDFS创建如下路径,用于存储历史日志
[hadoop@hadoop1 software]$ hadoop fs -mkdir /spark-history
说明1:由于Spark3.3.1非纯净版默认支持的是hive2.3.7版本,直接使用会和安装的Hive3.1.2出现兼容性问题。所以采用Spark纯净版jar包,不包含hadoop和hive相关依赖,避免冲突。
说明2:Hive任务最终由Spark来执行,Spark任务资源分配由Yarn来调度,该任务有可能被分配到集群的任何一个节点。所以需要将Spark的依赖上传到HDFS集群路径,这样集群中任何一个节点都能获取到。
① 上传并解压spark-3.3.1-bin-without-hadoop.tgz
[hadoop@hadoop1 software]$ tar -zxvf /data/software/spark-3.3.1-bin-without-hadoop.tgz
② 上传Spark纯净版jar包到HDFS
[hadoop@hadoop1 software]$ hadoop fs -mkdir /spark-jars
[hadoop@hadoop1 software]$ hadoop fs -put spark-3.3.1-bin-without-hadoop/jars/* /spark-jars
cp /data/module/spark-3.3.1/yarn/spark-3.3.1-yarn-shuffle.jar /data/module/hadoop-3.3.4/share/hadoop/yarn/lib/
6)将spark的jar包拷贝到yarn中
cp /data/module/spark-3.3.1/yarn/spark-3.3.1-yarn-shuffle.jar /data/module/hadoop-3.3.4/share/hadoop/yarn/lib/
[hadoop@hadoop1 ~]$ vim /data/module/hive/conf/hive-site.xml
添加如下内容
<!--Spark依赖位置(注意:端口号8020必须和namenode的端口号一致)-->
<property>
<name>spark.yarn.jars</name>
<value>hdfs://xxx:8020/spark-jars/*</value>
</property>
<!--Hive执行引擎-->
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
下载git代码库的spark代码:https://github.com/apache/doris-spark-connector
按照readme介绍打包自己的适配版连接器jar包
将jar包复制到spark的jars目录下,同时hdfs上的spark包目录也上传一份
cp /your_path/spark-doris-connector/target/spark-doris-connector-3.1_2.12-1.0.0-SNAPSHOT.jar $SPARK_HOME/jars
hadoop fs -put /your_path/spark-doris-connector/target/spark-doris-connector-3.1_2.12-1.0.0-SNAPSHOT.jar /spark-jars
运行spark-sql 测试:
//测试 CREATE TEMPORARY VIEW spark_doris1 USING doris OPTIONS( 'table.identifier'='demo.t1', 'fenodes'='xxx:8030', 'user'='xxx', 'password'='xxx' ); CREATE TEMPORARY VIEW spark_doris2 USING doris OPTIONS( 'table.identifier'='demo.t2', 'fenodes'='xxx:8030', 'user'='xxx', 'password'='xxx' ); INSERT INTO spark_doris1 select * from spark_doris2;
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。