赞
踩
Apache Griffin是一个开源的大数据数据质量解决方案,它支持批处理和流模式两种数据质量检测方式,可以从不同维度度量数据资产,从而提升数据的准确度和可信度。例如:离线任务执行完毕后检查源端和目标端的数据数量是否一致,源表的数据空值等
Griffin主要有三个组件Define、Measure和Analyze,具体如下图:
• Define:主要负责定义数据质量统计的维度,比如数据质量统计的时间跨度、统计的目标(源端和目标端的数据数量是否一致,数据源里某一字段的非空的数量、不重复值的数量、最大值、最小值、top5的值数量等)
• Measure:主要负责执行统计任务,生成统计结果
• Analyze:主要负责保存与展示统计结果
Apache Griffin的主要特点
• 度量:精确度、完整性、及时性、唯一性、有效性、一致性。
• 异常监测:利用预先设定的规则,检测出不符合预期的数据,提供不符合规则数据的下载。
• 异常告警:通过邮件或门户报告数据质量问题。
• 可视化监测:利用控制面板来展现数据质量的状态。
• 实时性:可以实时进行数据质量检测,能够及时发现问题。
• 可扩展性:可用于多个数据系统仓库的数据校验。
• 可伸缩性:工作在大数据量的环境中,目前运行的数据量约1.2PB(eBay环境)。
• 自助服务:Griffin提供了一个简洁易用的用户界面,可以管理数据资产和数据质量规则;同时用户可以通过控制面板查看数据质量结果和自定义显示内容
1.注册数据,把想要检测数据质量的数据源注册到griffin。
2.配置度量模型,可以从数据质量维度来定义模型,如:精确度、完整性、及时性、唯一性等。
3.配置定时任务提交spark集群,定时检查数据。
4.在门户界面上查看指标,分析数据质量校验结果。
端口名称 | 默认端口 | 说明 |
---|---|---|
elasticsearch.http.port | 9200 | elasticsearch Web端口 |
kibana.server.port | 5601 | kibana Web端口 |
fs.defaultFS | 9820 | hdfs 端口(2.x是9000) |
mapreduce.jobhistory.webapp.address | 19888 | history Web端口 |
Hadoop HDFS NameNode HTTP UI | 9870 | namenode Web端口 |
livy.server.port | 8998 | livy Web 端口 |
griffin.server.port | 8080 | Griffin Web端口 |
1.JDK (1.8 or later versions)
2.PostgreSQL(version 10.4) or MySQL(version 5.6及以上)
3.Hadoop (2.6.0 or later)
4.Hive (version 2.x),安装参考 :CentOS7 Hive 安装
5.Spark (version 2.2.1) 安装参考:centos安装spark
6.Livy 安装参考:livy安装与部属
7.ElasticSearch (5.0 or later versions). 参考:centos7下安装es5.*
griffin可能是我目前用过的大数据中间件中,依赖组件最多的。
Hadoop自不必说,作为了griffin的基本底层;mysql/POSTgreSQL是用来保存元数据;hive的作用是griffin比较source与target数据的时候,将两份数据作为两个表保存在hive中;而ElasticSearch是用来保存数据质量分析的结果的;Spark是griffin实际工作的地方,griffin都是提交的spark的任务;由于要提交spark任务,所以livy的作用就是提交任务和任务管理。
后面实践案例中可以看到,每个measure-job,然后按照规则触发的app,都会形成一个任务提交,在livy中体现的就是一个session:
Griffin中的Measure-Job和触发的app
livy管理的提交到spark的session:
中间件名称 | 版本号 |
---|---|
CentOS | CentOS 7 |
Java | 1.8.0_121 |
Hadoop | hadoop-3.1.3 |
ElasticSearch | elasticsearch-6.3.1 |
Kibana | kibana-6.3.1 |
zookeeper | zookeeper-3.5.7 |
mysql | 5.7.28 |
hive | hive-3.1.2 |
scala | 2.11.8 |
spark | 2.4.5 |
Livy | 0.5.0 |
Maven | 3.5.4 |
griffin | 0.5.0 |
Apache Griffin主要是做数据质量。他对其他的大数据集群有很严重的依赖:
Apache Hadoop:批量数据源,存储指标数据
Apache Hive: Hive Metastore
Apache Spark: 计算批量、实时指标
Apache Livy: 为服务提供 RESTful API 调用 Apache Spark
MySQL: 服务元数据
ElasticSearch:存储指标数据
Maven:项目管理工具软件,用于将griffin项目打包,后续执行griffin会用jar包运行,如果是生产库,则在本地安装maven打包后将jar包放到平台中运行griffin,因为maven运行时会安装好多组件,所以会需要外网
CentOS 7安过程省略。预先创建用户/用户组zhouchen
预先安装jdk1.8.0_121 +
预先安装zookeeper
预先安装hadoop
预先安装hive
预先安装scala
预先安装mysql
针对CentOS7以下
1.查看防火墙状态
service iptables status
2.停止防火墙
service iptables stop
3.启动防火墙
service iptables start
1.解压jdk1.8的安装包,并配置JAVA的环境变量
[root@hadoop202 ~]# vi /etc/profile.d/jdk.sh
#JAVA_HOME
export JAVA_HOME=/opt/java/jdk1.8.0_121
export JRE_HOME=$JAVA_HOME/jre
export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH
2.添加环境变量,刷新文件使其生效
[root@hadoop202 ~]$ /etc/profile.d/jdk.sh
3.查阅是否安装成功
1.下载mysql rpm包到/opt/package
2.安装mysql服务端
[root@hadoop202 package]# rpm -ivh MySQL-server-5.6.24-1.el6.x86_64.rpm
查看初始密码:
[root@hadoop202 mysql]# cat /root/.mysql_secret
查看mysql服务状态并启动
[root@hadoop202 package]# service mysqld status
##如果报错Unit mysqld.service could not be found.请查找mysql.server然后复制到/etc/init.d/mysqld
[root@hadoop202 mysql]# cp mysql.server /etc/init.d/mysqld
3.安装mysql客户端
[root@Hadoop202 package]# rpm -ivh MySQL-client-5.6.24-1.el6.x86_64.rpm
使用查询到的初始密码登陆客户端
[root@Hadoop202 package]# mysql -uroot -pMabbdrdTaRv0Gywq
修改密码:
mysql> SET PASSWORD=PASSWORD('zhou59420');
1.解压安装包
[zhouchen@hadoop202 software]$ tar -zxvf elasticsearch-6.3.1.tar.gz -C /opt/module
2.修改配置文件config/elasticsearch.yml,添加一下内容:
#集群名,主机名,端口号,关闭bootstrap程序:
当前节点名称:
集群节点:
3.将ElasticSearch-6.3.1安装目录分发到集群的另外两台机器
4.分别在其他节点修改config/elasticsearch.yml中的节点名称和network.host:
5.启动ES
[zhouchen@hadoop202 elasticsearch-5.2.2]$ bin/elasticsearch
6.创建Griffin索引
[zhouchen@hadoop202 elasticsearch-5.2.2]$ curl -H "Content-Type: application/json" -XPUT http://hadoop202:9200/griffin -d ' { "aliases": {}, "mappings": { "accuracy": { "properties": { "name": { "fields": { "keyword": { "ignore_above": 256, "type": "keyword" } }, "type": "text" }, "tmst": { "type": "date" } } } }, "settings": { "index": { "number_of_replicas": "2", "number_of_shards": "5" } } } '
[zhouchen@hadoop202 elasticsearch-6.3.1]$ sudo vim /etc/security/limits.conf
[zhouchen@hadoop202 elasticsearch-6.3.1]$ sudo vim /etc/security/limits.d/20-nproc.conf
[zhouchen@hadoop202 elasticsearch-6.3.1]$ sudo vim /etc/sysctl.conf
1.解压安装包
[zhouchen@hadoop202 software]$ tar -zxvf kibana-6.3.1-linux-x86_64.tar.gz -C /opt/module
2.配置conf/kibana.yml
1.查看进程
2.elasticsearch界面访问
http://hadoop202:9200/_cat/nodes?v
3.kibana界面访问
http://hadoop202:5601/app/kibana#/management/elasticsearch/index_management/home?_g=()
安装spark之前,默认已经安装好scala环境(这里我安装的版本是2.11.8)。我这里安装的是集群模式的spark
[zhouchen@hadoop202 bin]$ tar -zxvf spark-2.4.5-bin-hadoop2.7.tgz -C /opt/module/
[zhouchen@hadoop202 module]$ mv spark-2.4.5-bin-hadoop2.7/ spark-2.4.5
[zhouchen@hadoop102 conf]$ mv spark-defaults.conf.template spark-defaults.conf
[zhouchen@hadoop102 conf]$ vim spark-defaults.conf
#添加如下配置
spark.eventLog.enabled true
spark.eventLog.dir hdfs://hadoop202:9820/spark_directory
spark.sql.autoBroadcastJoinThreshold 1
[zhouchen@hadoop102 spark]$ hadoop fs -mkdir /spark_directory
[zhouchen@hadoop102 conf]$ mv spark-env.sh.template spark-env.sh
[zhouchen@hadoop102 conf]$ vim spark-env.sh
#添加如下参数
YARN_CONF_DIR=/opt/module/hadoop-3.1.3/etc/hadoop
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080
-Dspark.history.retainedApplications=30
-Dspark.history.fs.logDirectory=hdfs://hadoop202:9820/spark_directory"
export JAVA_HOME=/opt/module/jdk1.8
export SCALA_HOME=/usr/local/scala/scala-2.11.8
export SPARK_MASTER_IP=hadoop202
export HADOOP_HOME=/opt/module/hadoop-3.1.3
export HADOOP_CONF_DIR=/opt/module/hadoop-3.1.3/etc/hadoop
export SPARK_DIST_CLASSPATH=$(/opt/module/hadoop-3.1.3/bin/hadoop classpath) #不添加,就会报错
[zhouchen@hadoop202 hadoop]$ vim yarn-site.xml
#添加如下内容
<!--Spark相关配置-->
<!--是否启动一个线程检查每个任务正使用的物理内存量,如果任务超出分配值,则直接将其杀掉,默认是true -->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<!--是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,则直接将其杀掉,默认是true -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
将Hive中/opt/module/hive-2.3.6/lib/datanucleus-*.jar包拷贝到Spark的/opt/module/spark/jars路径
[zhouchen@hadoop202 lib]$ cp /opt/module/hive/lib/datanucleus-*.jar /opt/module/spark-2.4.5/jars/
将Hive中/opt/module/hive-2.3.6/conf/hive-site.xml包拷贝到Spark的/opt/module/spark/conf路径
[zhouchen@hadoop202 conf]$ cp /opt/module/hive/conf/hive-site.xml /opt/module/spark-2.4.5/conf/
[zhouchen@hadoop202 conf]$ mv slaves.template slaves
[zhouchen@hadoop202 conf]$ vim slaves
#修改内容如下:
hadoop203
hadoop204
[zhouchen@hadoop202 module]$ xsync /opt/module/spark-2.4.5/
1.启动spark集群模式:
[zhouchen@hadoop202 spark]$ bin/start-all.sh
[zhouchen@hadoop202 spark]$ xcall jps
2.检查进程
3.运行示例:
[zhouchen@hadoop202 spark]$ bin/run-example SparkPi 2>&1 | grep "Pi is roughly"
4.检查计算结果
1.解压
[zhouchen@hadoop202 software]$ unzip incubator-livy-0.5.0-incubating.zip -d /opt/module/
2.修改配置文件
[zhouchen@hadoop202 conf]$ mv livy.conf.template livy.conf [zhouchen@hadoop202 conf]$ mv livy-client.conf.template livy-client.conf [zhouchen@hadoop202 conf]$ mv livy-env.sh.template livy-env.sh [zhouchen@hadoop202 conf]$ mv spark-blacklist.conf.template spark-blacklist.conf [zhouchen@hadoop202 conf]$ mv log4j.properties.template log4j.properties [zhouchen@hadoop202 conf]$ vim livy-env.sh #添加的内容如下 export JAVA_HOME=/opt/module/jdk1.8 export SPARK_HOME=/opt/module/spark-2.4.5 export SPARK_CONF_DIR=$SPARK_HOME/conf export HADOOP_HOME=/opt/module/hadoop-3.1.3 export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop [zhouchen@hadoop202 conf]$ vim livy.conf #添加的内容如下 livy.server.host = hadoop202 livy.server.port = 8998 livy.spark.master = yarn livy.repl.enableHiveContext = true livy.spark.deploy-mode = client
3.手动创建logs目录
[zhouchen@hadoop202 conf]$ mkdir /opt/module/livy-0.5.0/logs
4.启动(脚本不带start操作会打印前台进程,不保存pid无法关闭,建议带start操作)
[zhouchen@hadoop202 livy-0.5.0]$ bin/livy-server
5.检查安装
http://hadoop202:8998
1.解压安装包
[zhouchen@hadoop202 software]$ tar -zxvf apache-maven-3.5.4-bin.tar.gz -C /opt/module/
[zhouchen@hadoop202 module]$ mv apache-maven-3.5.4/ maven-3.5.4
2.添加环境变量
[zhouchen@hadoop202 software]$ sudo vim /etc/profile.d/my_env.sh
#添加的内容如下
#MAVEN_HOME
export MAVEN_HOME=/opt/module/maven-3.5.4
export PATH=$MAVEN_HOME/bin:$PATH
[zhouchen@hadoop202 software]$ source /etc/profile.d/my_env.sh
3.指定阿里云为镜像源
[zhouchen@hadoop202 conf]$ vim settings.xml #添加的内容如下 <!-- 添加阿里云镜像--> <mirror> <id>nexus-aliyun</id> <mirrorOf>central</mirrorOf> <name>Nexus aliyun</name> <url>http://maven.aliyun.com/nexus/content/groups/public</url> </mirror> <mirror> <id>UK</id> <name>UK Central</name> <url>http://uk.maven.org/maven2</url> <mirrorOf>central</mirrorOf> </mirror> <mirror> <id>repo1</id> <mirrorOf>central</mirrorOf> <name>Human Readable Name for this Mirror.</name> <url>http://repo1.maven.org/maven2/</url> </mirror> <mirror> <id>repo2</id> <mirrorOf>central</mirrorOf> <name>Human Readable Name for this Mirror.</name> <url>http://repo2.maven.org/maven2/</url> </mirror>
4.创建.m2目录
[zhouchen@hadoop202 zhouchen]$ mkdir -p /home/zhouchen/.m2
5.测试安装结果
[zhouchen@hadoop202 module]$ mvn -v
Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-18T02:33:14+08:00)
Maven home: /opt/module/maven-3.5.4
Java version: 1.8.0_121, vendor: Oracle Corporation, runtime: /opt/module/jdk1.8/jre
Default locale: zh_CN, platform encoding: UTF-8
OS name: "linux", version: "3.10.0-1127.el7.x86_64", arch: "amd64", family: "unix"
编写各个集群启动/停止脚本
[zhouchen@hadoop202 bin]$ vi hadoop.sh
#!/bin/bash case $1 in "start"){ echo " =================== 启动 hadoop集群 ===================" echo " --------------- 启动 hdfs ---------------" ssh hadoop202 "/opt/module/hadoop-3.1.3/sbin/start-dfs.sh" echo " --------------- 启动 yarn ---------------" ssh hadoop203 "/opt/module/hadoop-3.1.3/sbin/start-yarn.sh" echo " --------------- 启动 historyserver ---------------" ssh hadoop202 "cd /opt/module/hadoop-3.1.3/ ; mapred --daemon start historyserver" };; "stop"){ echo " =================== 关闭 hadoop集群 ===================" echo " --------------- 关闭 historyserver ---------------" ssh hadoop202 "cd /opt/module/hadoop-3.1.3/ ; mapred --daemon stop historyserver" echo " --------------- 关闭 yarn ---------------" ssh hadoop203 "/opt/module/hadoop-3.1.3/sbin/stop-yarn.sh" echo " --------------- 关闭 hdfs ---------------" ssh hadoop202 "/opt/module/hadoop-3.1.3/sbin/stop-dfs.sh" };; *) echo "你启动的姿势不对" echo " start 开启hadoop服务" echo " stop 停止hadoop服务" ;; esac
[zhouchen@hadoop202 bin]$ vi hiveservices.sh
#!/bin/bash HIVE_LOG_DIR=$HIVE_HOME/logs if [ ! -d $HIVE_LOG_DIR ] then mkdir -p $HIVE_LOG_DIR fi #检查进程是否运行正常,参数1为进程名,参数2为进程端口 function check_process() { pid=$(ps -ef 2>/dev/null | grep -v grep | grep -i $1 | awk '{print $2}') ppid=$(netstat -nltp 2>/dev/null | grep $2 | awk '{print $7}' | cut -d '/' -f 1) echo $pid [[ "$pid" =~ "$ppid" ]] && [ "$ppid" ] && return 0 || return 1 } function hive_start() { metapid=$(check_process HiveMetastore 9083) cmd="nohup hive --service metastore >$HIVE_LOG_DIR/metastore.log 2>&1 &" cmd=$cmd" sleep 4; hdfs dfsadmin -safemode wait >/dev/null 2>&1" [ -z "$metapid" ] && eval $cmd || echo "Metastroe服务已启动" server2pid=$(check_process HiveServer2 10000) cmd="nohup hive --service hiveserver2 >$HIVE_LOG_DIR/hiveServer2.log 2>&1 &" [ -z "$server2pid" ] && eval $cmd || echo "HiveServer2服务已启动" } function hive_stop() { metapid=$(check_process HiveMetastore 9083) [ "$metapid" ] && kill $metapid || echo "Metastore服务未启动" server2pid=$(check_process HiveServer2 10000) [ "$server2pid" ] && kill $server2pid || echo "HiveServer2服务未启动" } case $1 in "start") hive_start ;; "stop") hive_stop ;; "restart") hive_stop sleep 2 hive_start ;; "status") check_process HiveMetastore 9083 >/dev/null && echo "Metastore服务运行正常" || echo "Metastore服务运行异常" check_process HiveServer2 10000 >/dev/null && echo "HiveServer2服务运行正常" || echo "HiveServer2服务运行异常" ;; *) echo Invalid Args! echo 'Usage: '$(basename $0)' start|stop|restart|status' ;; esac
[zhouchen@hadoop202 bin]$ vi zk.sh
#! /bin/bash case $1 in "start"){ for i in hadoop202 hadoop203 hadoop204 do ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh start" done };; "stop"){ for i in hadoop202 hadoop203 hadoop204 do ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh stop" done };; "status"){ for i in hadoop202 hadoop203 hadoop204 do ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh status" done };; esac
[zhouchen@hadoop202 bin]$ vi es.sh
#!/bin/bash ELASTICSEARCH_HOME=/opt/module/elasticsearch-6.3.1 KIBANA_HOME=/opt/module/kibana-6.3.1 case $1 in "start") { for i in hadoop202 hadoop203 hadoop204 do ssh $i "source /etc/profile;${ELASTICSEARCH_HOME}/bin/elasticsearch >$ELASTICSEARCH_HOME/logs/es.log 2>$ELASTICSEARCH_HOME/logs/error.log &" done nohup ${KIBANA_HOME}/bin/kibana >$KIBANA_HOME/kibana.log 2>$KIBANA_HOME/error.log & };; "stop") { ps -ef|grep ${KIBANA_HOME} |grep -v grep|awk '{print $2}'|xargs kill for i in hadoop202 hadoop203 hadoop204 do ssh $i "ps -ef|grep $ELASTICSEARCH_HOME |grep -v grep|awk '{print \$2}'|xargs kill" >/dev/null 2>&1 done };; *){ echo "你启动的姿势不正确, 请使用参数 start 来启动es集群, 使用参数 stop 来关闭es集群" };; esac
[zhouchen@hadoop202 bin]$ vi livy.sh
#!/bin/bash
LIVY_HOME=/opt/module/livy-0.5.0
case $1 in
"start") {
ssh hadoop202 "${LIVY_HOME}/bin/livy-server start"
};;
"stop") {
ssh hadoop202 "${LIVY_HOME}/bin/livy-server stop"
};;
*){
echo "你启动的姿势不正确, 请使用参数 start 来启动, 使用参数 stop 来关闭"
};;
esac
[zhouchen@hadoop202 software]$ unzip griffin-master.zip -d /opt/module/
[zhouchen@hadoop202 griffin-master]$ mysql -u root -e "create database quartz" -p
service/src/main/resources/Init_quartz_mysql_innodb.sql
[zhouchen@hadoop202 griffin-master]$ mysql -u root -p quartz < service/src/main/resources/Init_quartz_mysql_innodb.sql
1.修改/opt/module/griffin-master/service/pom.xml文件
注释掉org.postgresql,添加mysql依赖。
[zhouchen@hadoop202 service]$ vim pom.xml
<!--
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>${postgresql.version}</version>
</dependency>
-->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
</dependency>
##注意:版本号删除掉
2.修改/opt/module/griffin-master/service/src/main/resources/application.properties文件
[zhouchen@hadoop202 service]$ vim /opt/module/griffin-master/service/src/main/resources/application.properties # Apache Griffin应用名称 spring.application.name=griffin_service # MySQL数据库配置信息 spring.datasource.url=jdbc:mysql://hadoop202:3306/quartz?autoReconnect=true&useSSL=false spring.datasource.username=root spring.datasource.password=000000 spring.jpa.generate-ddl=true spring.datasource.driver-class-name=com.mysql.jdbc.Driver spring.jpa.show-sql=true # Hive metastore配置信息 hive.metastore.uris=thrift://hadoop202:9083 hive.metastore.dbname=default hive.metastore.sasl.enabled=false #默认为false,这里显式地配置 hive.metastore.kerberos.principal= #默认为空 hive.metastore.kerberos.keytab.file=hive-metastore/_HOST@EXAMPLE.COM #默认值 hive.hmshandler.retry.attempts=15 hive.hmshandler.retry.interval=2000ms # Hive cache time cache.evict.hive.fixedRate.in.milliseconds=900000 # Kafka schema registry按需配置 kafka.schema.registry.url=http://hadoop202:8081 # Update job instance state at regular intervals jobInstance.fixedDelay.in.milliseconds=60000 # Expired time of job instance which is 7 days that is 604800000 milliseconds.Time unit only supports milliseconds jobInstance.expired.milliseconds=604800000 # schedule predicate job every 5 minutes and repeat 12 times at most #interval time unit s:second m:minute h:hour d:day,only support these four units predicate.job.interval=5m predicate.job.repeat.count=12 # external properties directory location external.config.location= # external BATCH or STREAMING env external.env.location= # login strategy ("default" or "ldap") login.strategy=default # ldap ldap.url=ldap://hostname:port ldap.email=@example.com ldap.searchBase=DC=org,DC=example ldap.searchPattern=(sAMAccountName={0}) # hdfs default name fs.defaultFS= # elasticsearch elasticsearch.host=hadoop202 elasticsearch.port=9200 elasticsearch.scheme=http # elasticsearch.user = user # elasticsearch.password = password # livy livy.uri=http://hadoop202:8998/batches # yarn url yarn.uri=http://hadoop203:8088 # griffin event listener internal.event.listeners=GriffinJobEventHook
3.修改/opt/module/griffin-master/service/src/main/resources/sparkProperties.json文件
[zhouchen@hadoop202 service]$ vim /opt/module/griffin-master/service/src/main/resources/sparkProperties.json { "file": "hdfs://hadoop202:9820/griffin/griffin-measure.jar", "className": "org.apache.griffin.measure.Application", "name": "griffin", "queue": "default", "numExecutors": 2, "executorCores": 1, "driverMemory": "1g", "executorMemory": "1g", "conf": { "spark.yarn.dist.files": "hdfs://hadoop202:9820/home/spark_conf/hive-site.xml" }, "files": [ ] }
4.修改/opt/module/griffin-master/service/src/main/resources/env/env_batch.json文件
[zhouchen@hadoop202 service]$ vim /opt/module/griffin-master/service/src/main/resources/env/env_batch.json { "spark": { "log.level": "INFO" }, "sinks": [ { "type": "CONSOLE", "config": { "max.log.lines": 10 } }, { "type": "HDFS", "config": { "path": "hdfs://hadoop202:9820/griffin/persist", "max.persist.lines": 10000, "max.lines.per.file": 10000 } }, { "type": "ELASTICSEARCH", "config": { "method": "post", "api": "http://hadoop202:9200/griffin/accuracy", "connection.timeout": "1m", "retry": 10 } } ], "griffin.checkpoint": [] }
5./opt/module/griffin-master/service/src/main/resources/env/env_streaming.json文件
[zhouchen@hadoop202 service]$ vim /opt/module/griffin-master/service/src/main/resources/env/env_streaming.json { "spark": { "log.level": "WARN", "checkpoint.dir": "hdfs://hadoop202:9820/griffin/checkpoint/${JOB_NAME}", "init.clear": true, "batch.interval": "1m", "process.interval": "5m", "config": { "spark.default.parallelism": 4, "spark.task.maxFailures": 5, "spark.streaming.kafkaMaxRatePerPartition": 1000, "spark.streaming.concurrentJobs": 4, "spark.yarn.maxAppAttempts": 5, "spark.yarn.am.attemptFailuresValidityInterval": "1h", "spark.yarn.max.executor.failures": 120, "spark.yarn.executor.failuresValidityInterval": "1h", "spark.hadoop.fs.hdfs.impl.disable.cache": true } }, "sinks": [ { "type": "CONSOLE", "config": { "max.log.lines": 100 } }, { "type": "HDFS", "config": { "path": "hdfs://hadoop202:9820/griffin/persist", "max.persist.lines": 10000, "max.lines.per.file": 10000 } }, { "type": "ELASTICSEARCH", "config": { "method": "post", "api": "http://hadoop202:9200/griffin/accuracy" } } ], "griffin.checkpoint": [ { "type": "zk", "config": { "hosts": "zk:2181", "namespace": "griffin/infocache", "lock.path": "lock", "mode": "persist", "init.clear": true, "close.clear": false } } ] }
1.启动hadoop
[zhouchen@hadoop202 bin]$ bash hadoop.sh start
2.启动zookeeper
[zhouchen@hadoop202 bin]$ bash zk.sh start
3.启动hive
[zhouchen@hadoop202 bin]$ bash hiveservices.sh start
4.启动ElasticSearch
[zhouchen@hadoop202 bin]$ bash es.sh start
5.启动livy
[zhouchen@hadoop202 bin]$ bash livy.sh start
6.启动Spark
[zhouchen@hadoop202 bin]$ /opt/module/spark-2.4.5/sbin/start-all.sh
7.在/opt/module/griffin-master路径执行maven命令,开始编译Griffin源码
[zhouchen@hadoop202 griffin-master]$ mvn -Dmaven.test.skip=true clean install
8.编译完成
命令执行完成后,会在Service和Measure模块的target目录下分别看到service-0.6.0.jar和measure-0.6.0.jar两个jar包
1.修改/opt/module/griffin-master/measure/target/measure-0.6.0-SNAPSHOT.jar名称
[zhouchen@hadoop202 measure]$ mv measure-0.6.0-SNAPSHOT.jar griffin-measure.jar
2.上传griffin-measure.jar到HDFS文件目录里
[zhouchen@hadoop202 measure]$ hadoop fs -mkdir /griffin/
[zhouchen@hadoop202 measure]$ hadoop fs -put griffin-measure.jar /griffin/
注意:这样做的目的主要是因为Spark在YARN集群上执行任务时,需要到HDFS的/griffin目录下加载griffin-measure.jar,避免发生类org.apache.griffin.measure.Application找不到的错误。
3.上传hive-site.xml文件到HDFS的/home/spark_conf/路径
[zhouchen@hadoop202 ~]$ hadoop fs -mkdir -p /home/spark_conf/
[zhouchen@hadoop202 ~]$ hadoop fs -put /opt/module/hive-2.3.6/conf/hive-site.xml /home/spark_conf/
4.进入到/opt/module/griffin-master/service/target/路径,运行service-0.6.0-SNAPSHOT.jar
控制台启动:控制台打印信息
[zhouchen@hadoop202 target]$ java -jar service-0.6.0-SNAPSHOT.jar
5.后台启动:启动后台并把日志归写倒service.out
[zhouchen@hadoop202 ~]$ nohup java -jar service-0.6.0-SNAPSHOT.jar>service.out 2>&1 &
http://hadoop202:8080 默认账户和密码都是无
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。