赞
踩
HBase适合场景:单表超千万,上亿,且高并发!
HBase不适合场景:主要需求是数据分析,比如做报表。数据量规模不大,对实时性要求高!
HBase的查询工具有很多,如:Hive、Tez、Impala、Spark SQL、Kylin、Phoenix等。
①、保证安装Hive的Linux服务器的环境变量中有JAVA_HOME
②、基于HADOOP工作,保证安装Hive的Linux服务器的环境变量中有HADOOP_HOME
在hadoop102服务器上解压HBase到指定目录:
[whx@hadoop102 software]$ tar -zxvf HBase-1.3.1-bin.tar.gz -C /opt/module
修改/opt/module/hbase-1.3.1目录名称为/opt/module/hbase
[whx@hadoop102 module]$ ll total 28 drwxrwxr-x. 9 whx whx 4096 Jan 31 14:45 flume drwxr-xr-x. 11 whx whx 4096 Jan 31 10:43 hadoop-2.7.2 drwxrwxr-x. 7 whx whx 4096 Feb 2 10:11 hbase-1.3.1 drwxrwxr-x. 9 whx whx 4096 Jan 30 19:27 hive drwxr-xr-x. 8 whx whx 4096 Dec 13 2016 jdk1.8.0_121 drwxr-xr-x. 8 whx whx 4096 Feb 1 16:32 kafka drwxr-xr-x. 11 whx whx 4096 Jan 29 22:01 zookeeper-3.4.10 [whx@hadoop102 module]$ mv hbase-1.3.1/ hbase [whx@hadoop102 module]$ ll total 28 drwxrwxr-x. 9 whx whx 4096 Jan 31 14:45 flume drwxr-xr-x. 11 whx whx 4096 Jan 31 10:43 hadoop-2.7.2 drwxrwxr-x. 7 whx whx 4096 Feb 2 10:11 hbase drwxrwxr-x. 9 whx whx 4096 Jan 30 19:27 hive drwxr-xr-x. 8 whx whx 4096 Dec 13 2016 jdk1.8.0_121 drwxr-xr-x. 8 whx whx 4096 Feb 1 16:32 kafka drwxr-xr-x. 11 whx whx 4096 Jan 29 22:01 zookeeper-3.4.10 [whx@hadoop102 module]$
JAVA_HOME配置(可不用改 因为Linux服务器的环境变量文件/etc/profile里面已经把JAVA_HOME设为全局了!)
export JAVA_HOME=/opt/module/jdk1.8.0_144
注释掉Configure PermSize(所用的JDK为1.8以上,不需要此配置)
# Configure PermSize. Only needed in JDK7. You can safely remove it for JDK8+
#export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -XX:PermSize=128m -XX:MaxPermSize=128m"
#export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -XX:PermSize=128m -XX:MaxPermSize=128m"
使用HBase外部的Zookeeper,而不是用HBase内置的Zookeeper。默认是使用HBase内置的Zookeeper(HBASE_MANAGES_ZK=true)
# Tell HBase whether it should manage it's own instance of Zookeeper or not.
export HBASE_MANAGES_ZK=false
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- /** * * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ --> <configuration> <!-- 每个regionServer的共享目录,用来持久化Hbase,默认情况下在/tmp/hbase下面 --> <property> <name>hbase.rootdir</name> <value>hdfs://hadoop101:9000/HBase</value> </property> <!-- hbase集群模式,false表示hbase的单机,true表示是分布式模式 --> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <!-- hbase依赖的外部Zookeeper地址 --> <property> <name>hbase.zookeeper.quorum</name> <value>hadoop101:2181,hadoop102:2181,hadoop103:2181</value> </property> <!--外部Zookeeper各个Linux服务器节点上保存数据的目录--> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/opt/module/zookeeper-3.4.10/datas</value> </property> </configuration>
[whx@hadoop102 ~]$ sudo vim /etc/profile
JAVA_HOME=/opt/module/jdk1.8.0_121
HADOOP_HOME=/opt/module/hadoop-2.7.2
HIVE_HOME=/opt/module/hive
FLUME_HOME=/opt/module/flume
HBASE_HOME=/opt/module/hbase
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$FLUME_HOME/bin:$HBASE_HOME/bin
export JAVA_HOME HADOOP_HOME HIVE_HOME FLUME_HOME HBASE_HOME PATH
[whx@hadoop102 ~]$ source /etc/profile
[whx@hadoop102 module]$ xsync.sh hbase/
[whx@hadoop102 ~]$ xsync.sh /etc/profile
启动Zookeeper
[whx@hadoop102 ~]$ xcall.sh /opt/module/zookeeper-3.4.10/bin/zkServer.sh start 要执行的命令是/opt/module/zookeeper-3.4.10/bin/zkServer.sh start ----------------------------hadoop101---------------------------------- ZooKeeper JMX enabled by default Using config: /opt/module/zookeeper-3.4.10/bin/../conf/zoo.cfg Starting zookeeper ... STARTED ----------------------------hadoop102---------------------------------- ZooKeeper JMX enabled by default Using config: /opt/module/zookeeper-3.4.10/bin/../conf/zoo.cfg Starting zookeeper ... STARTED ----------------------------hadoop103---------------------------------- ZooKeeper JMX enabled by default Using config: /opt/module/zookeeper-3.4.10/bin/../conf/zoo.cfg Starting zookeeper ... STARTED [whx@hadoop102 ~]$ xcall.sh jps 要执行的命令是jps ----------------------------hadoop101---------------------------------- 3037 QuorumPeerMain 3071 Jps ----------------------------hadoop102---------------------------------- 3633 QuorumPeerMain 3677 Jps ----------------------------hadoop103---------------------------------- 3072 Jps 3038 QuorumPeerMain [whx@hadoop102 ~]$
启动HDFS
[whx@hadoop102 ~]$ start-dfs.sh Starting namenodes on [hadoop101] hadoop101: starting namenode, logging to /opt/module/hadoop-2.7.2/logs/hadoop-whx-namenode-hadoop101.out hadoop102: starting datanode, logging to /opt/module/hadoop-2.7.2/logs/hadoop-whx-datanode-hadoop102.out hadoop101: starting datanode, logging to /opt/module/hadoop-2.7.2/logs/hadoop-whx-datanode-hadoop101.out hadoop103: starting datanode, logging to /opt/module/hadoop-2.7.2/logs/hadoop-whx-datanode-hadoop103.out Starting secondary namenodes [hadoop103] hadoop103: starting secondarynamenode, logging to /opt/module/hadoop-2.7.2/logs/hadoop-whx-secondarynamenode-hadoop103.out [whx@hadoop102 ~]$ xcall.sh jps 要执行的命令是jps ----------------------------hadoop101---------------------------------- 3361 Jps 3153 NameNode 3275 DataNode 3037 QuorumPeerMain ----------------------------hadoop102---------------------------------- 3633 QuorumPeerMain 3876 DataNode 4106 Jps ----------------------------hadoop103---------------------------------- 3281 SecondaryNameNode 3154 DataNode 3331 Jps 3038 QuorumPeerMain [whx@hadoop102 ~]$
HBase中的对象的表现形式
[whx@hadoop102 ~]$ xcall.sh sudo ntpdate -u ntp4.aliyun.com
要执行的命令是sudo ntpdate -u ntp4.aliyun.com
----------------------------hadoop101----------------------------------
2 Feb 11:37:55 ntpdate[5838]: adjust time server 203.107.6.88 offset -0.010311 sec
----------------------------hadoop102----------------------------------
2 Feb 11:37:56 ntpdate[8752]: adjust time server 203.107.6.88 offset -0.000953 sec
----------------------------hadoop103----------------------------------
2 Feb 11:37:56 ntpdate[5930]: adjust time server 203.107.6.88 offset -0.004596 sec
[whx@hadoop102 ~]$
<property>
<name>hbase.master.maxclockskew</name>
<value>180000</value>
<description>Time difference of regionserver from master</description>
</property>
启动任意一个节点上的master(只启动一个节点上的master,比如hadoop102 )
[whx@hadoop102 ~]$ /opt/module/hbase/bin/hbase-daemon.sh start master starting master, logging to /opt/module/hbase/bin/../logs/hbase-whx-master-hadoop102.out [whx@hadoop102 ~]$ xcall.sh jps 要执行的命令是jps ----------------------------hadoop101---------------------------------- 3153 NameNode 3275 DataNode 3037 QuorumPeerMain 3670 Jps ----------------------------hadoop102---------------------------------- 3876 DataNode 4426 HMaster 3633 QuorumPeerMain 4641 Jps ----------------------------hadoop103---------------------------------- 3281 SecondaryNameNode 3154 DataNode 3038 QuorumPeerMain 3654 Jps [whx@hadoop102 ~]$
停止hadoop102 节点上的master
[whx@hadoop102 ~]$ /opt/module/hbase/bin/hbase-daemon.sh stop master
stopping master.
[whx@hadoop102 ~]$
启动所有服务器上的regionserver
[whx@hadoop102 ~]$ xcall.sh /opt/module/hbase/bin/hbase-daemon.sh start regionserver 要执行的命令是/opt/module/hbase/bin/hbase-daemon.sh start regionserver ----------------------------hadoop101---------------------------------- starting regionserver, logging to /opt/module/hbase/bin/../logs/hbase-whx-regionserver-hadoop101.out ----------------------------hadoop102---------------------------------- starting regionserver, logging to /opt/module/hbase/bin/../logs/hbase-whx-regionserver-hadoop102.out ----------------------------hadoop103---------------------------------- starting regionserver, logging to /opt/module/hbase/bin/../logs/hbase-whx-regionserver-hadoop103.out [whx@hadoop102 ~]$ xcall.sh jps 要执行的命令是jps ----------------------------hadoop101---------------------------------- 3153 NameNode 3275 DataNode 3037 QuorumPeerMain 3447 HRegionServer 3610 Jps ----------------------------hadoop102---------------------------------- 3876 DataNode 4426 HMaster 3633 QuorumPeerMain 4196 HRegionServer 4363 Jps ----------------------------hadoop103---------------------------------- 3281 SecondaryNameNode 3154 DataNode 3038 QuorumPeerMain 3425 HRegionServer 3592 Jps [whx@hadoop102 ~]$
停止所有服务器上的regionserver
[whx@hadoop102 ~]$ xcall.sh /opt/module/hbase/bin/hbase-daemon.sh stop regionserver
要执行的命令是/opt/module/hbase/bin/hbase-daemon.sh stop regionserver
----------------------------hadoop101----------------------------------
stopping regionserver.
----------------------------hadoop102----------------------------------
stopping regionserver.
----------------------------hadoop103----------------------------------
stopping regionserver.
[whx@hadoop102 ~]$
端口说明:
通过Web浏览器,输入 hadoop102:16010,可以通过hadoop102查看HMaster信息
hbase-daemons.sh、start-hbase.sh、stop-hbase.sh命令的前提:
hadoop101
hadoop102
hadoop103
群起所有服务器上的regionserver
[whx@hadoop102 ~]$ /opt/module/hbase/bin/hbase-daemons.sh start regionserver
hadoop102: starting regionserver, logging to /opt/module/hbase/bin/../logs/hbase-whx-regionserver-hadoop102.out
hadoop101: starting regionserver, logging to /opt/module/hbase/bin/../logs/hbase-whx-regionserver-hadoop101.out
hadoop103: starting regionserver, logging to /opt/module/hbase/bin/../logs/hbase-whx-regionserver-hadoop103.out
[whx@hadoop102 ~]$
群停所有服务器上的regionserver
[whx@hadoop102 ~]$ /opt/module/hbase/bin/hbase-daemons.sh stop regionserver
hadoop102: stopping regionserver.....
hadoop103: stopping regionserver.....
hadoop101: stopping regionserver........
[whx@hadoop102 ~]$
[whx@hadoop102 ~]$ /opt/module/hbase/bin/start-hbase.sh
starting master, logging to /opt/module/hbase/bin/../logs/hbase-whx-master-hadoop102.out
hadoop103: starting regionserver, logging to /opt/module/hbase/bin/../logs/hbase-whx-regionserver-hadoop103.out
hadoop101: starting regionserver, logging to /opt/module/hbase/bin/../logs/hbase-whx-regionserver-hadoop101.out
hadoop102: starting regionserver, logging to /opt/module/hbase/bin/../logs/hbase-whx-regionserver-hadoop102.out
[whx@hadoop102 ~]$ xcall.sh jps
[whx@hadoop102 ~]$ /opt/module/hbase/bin/stop-hbase.sh
stopping hbase..................
[whx@hadoop102 ~]$
先在HBase数据库中创建一个测试表:whx_table
hbase(main):019:0> create 'whx_table','cf_user','cf_company'
0 row(s) in 1.2170 seconds
=> Hbase::Table - whx_table
hbase(main):020:0> desc 'whx_table'
Table whx_table is ENABLED
whx_table
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf_company', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'cf_user', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
2 row(s) in 0.0100 seconds
hbase(main):021:0>
向whx_table表中插入数据
hbase(main):027:0> put 'whx_table','1001','cf_user:firstname','Nick' 0 row(s) in 0.0470 seconds hbase(main):029:0> put 'whx_table','1001','cf_user:lastname','Lee' 0 row(s) in 0.0150 seconds hbase(main):030:0> put 'whx_table','1001','cf_company:name','HUAWEI' 0 row(s) in 0.0140 seconds hbase(main):031:0> put 'whx_table','1001','cf_company:address','changanjie10hao' 0 row(s) in 0.0080 seconds hbase(main):033:0> get 'whx_table','1001' COLUMN CELL cf_company:address timestamp=1612408142513, value=changanjie10hao cf_company:name timestamp=1612408141461, value=HUAWEI cf_user:firstname timestamp=1612408054676, value=Nick cf_user:lastname timestamp=1612408141421, value=Lee 1 row(s) in 0.0200 seconds hbase(main):034:0>
首先保障环境变量文件 /etc/profile 文件中配备了 Hive、HBase的环境变量
HIVE_HOME=/opt/module/hive
HBASE_HOME=/opt/module/hbase
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$FLUME_HOME/bin:$HBASE_HOME/bin
export JAVA_HOME HADOOP_HOME HIVE_HOME FLUME_HOME HBASE_HOME PATH
因为我们后续可能会在操作Hive的同时对HBase也会产生影响,所以Hive需要持有操作HBase的Jar,那么接下来拷贝Hive所依赖的Jar包(或者使用软连接的形式)。在hadoop102服务器上执行以下命令,创建软连接:
ln -s $HBASE_HOME/lib/HBase-common-1.3.1.jar $HIVE_HOME/lib/HBase-common-1.3.1.jar
ln -s $HBASE_HOME/lib/HBase-server-1.3.1.jar $HIVE_HOME/lib/HBase-server-1.3.1.jar
ln -s $HBASE_HOME/lib/HBase-client-1.3.1.jar $HIVE_HOME/lib/HBase-client-1.3.1.jar
ln -s $HBASE_HOME/lib/HBase-protocol-1.3.1.jar $HIVE_HOME/lib/HBase-protocol-1.3.1.jar
ln -s $HBASE_HOME/lib/HBase-it-1.3.1.jar $HIVE_HOME/lib/HBase-it-1.3.1.jar
ln -s $HBASE_HOME/lib/htrace-core-3.1.0-incubating.jar $HIVE_HOME/lib/htrace-core-3.1.0-incubating.jar
ln -s $HBASE_HOME/lib/HBase-hadoop2-compat-1.3.1.jar $HIVE_HOME/lib/HBase-hadoop2-compat-1.3.1.jar
ln -s $HBASE_HOME/lib/HBase-hadoop-compat-1.3.1.jar $HIVE_HOME/lib/HBase-hadoop-compat-1.3.1.jar
上面的命令是在$HIVE_HOME/lib 目录中创建 $HBASE_HOME/lib/中源文件的软连接,创建软连接替代jar文件的复制来避免服务器中的jar包冗余。
Hive读取HBase里的数据时,需要使用Zookeeper,所以在hive-site.xml中添加zookeeper的属性,如下:
<property>
<name>hive.zookeeper.quorum</name>
<value>hadoop101,hadoop102,hadoop103</value>
<description>The list of ZooKeeper servers to talk to. This is only needed for read/write locks.</description>
</property>
<property>
<name>hive.zookeeper.client.port</name>
<value>2181</value>
<description>The port of ZooKeeper servers to talk to. This is only needed for read/write locks.</description>
</property>
添加后的hive-site.xml为:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hadoop102:3306/metastore?createDatabaseIfNotExist=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> <description>username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>123456</value> <description>password to use against metastore database</description> </property> <!--自定义Hive的数据仓库在HDFS中的位置。默认为:/user/hive/warehouse--> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> <description>location of default database for the warehouse</description> </property> <!--实现显示当前数据库,以及查询表的头信息配置--> <property> <name>hive.cli.print.header</name> <value>true</value> </property> <property> <name>hive.cli.print.current.db</name> <value>true</value> </property> <!--Hive读取HBase里的数据时,需要用到Zookeeper--> <property> <name>hive.zookeeper.quorum</name> <value>hadoop101,hadoop102,hadoop103</value> <description>The list of ZooKeeper servers to talk to. This is only needed for read/write locks.</description> </property> <property> <name>hive.zookeeper.client.port</name> <value>2181</value> <description>The port of ZooKeeper servers to talk to. This is only needed for read/write locks.</description> </property> </configuration>
数据已经在HBase中,只需要在Hive建表,查询即可
create external table hbase_t3(
id int,
age int,
gender string,
name string
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:age,info:gender,info:name")
TBLPROPERTIES ("hbase.table.name" = "t3");
external 表示所创建的是外部表,因为此时的数据实在HBase里,而不是在Hive中。所以Hive只能创建外部表;
Storage Handlers是Hive的一个扩展模块,帮助Hive分析不在hdfs存储的数据。
SERDE:SerDe是序列化器和反序列化器所在库的简写。SerDe的作用是在运行MR程序时,从输入目录中读取数据,反序列化为Mapper输入的key-value对象,或将Reducer写出的key-value对象,使用序列化存储到指定的输出目录中。输入目录或输出目录中数据的格式不同,就需要使用不同的SerDe。普通的文件数据,以及在建表时,如果不指定serde,默认使用LazySimpleSerDe!
例如: 数据中全部是JSON格式
{"name":"songsong","friends":["bingbing","lili"]}
{"name":"songsong1","friends": ["bingbing1" , "lili1"]}
错误写法:
create table testSerde(
name string,
friends array<string>
)
ROW FORMAT DELIMITED fields terminated by ','
collection items terminated by ','
lines terminated by '\n';
如果指定了row format delimited ,此时默认使用LazySimpleSerDe,而LazySimpleSerDe只能处理有分隔符的普通文本。现在数据是JSON,格式{},只能用JSONSerDE
create table testSerde2(
name string,
friends array<string>
)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE
WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key,info:age,info:gender,info:name”) 表示HBase与Hive之间字段的一对一对应;
Hive字段 | HBase字段 | 备注 |
---|---|---|
id | :key | :key 代表 HBase里的 rowkey,即使用 id 作为主键 |
age | info:age | info:age 表示 列族info里的age列 |
gender | info:gender | info:gender表示列族info里的gender列 |
name | info:name | info:name 表示列族info里的name列 |
TBLPROPERTIES (“hbase.table.name” = “t3”) 表示所创建的Hive表“hbase_t3”所对应的HBase表名为“t3”
数据还尚未插入到hbase,可以在hive中建表,建表后,在hive中执行数据的导入,将数据导入到hbase,再分析。
在Hive中创建表
CREATE TABLE `hbase_emp`(
`empno` int,
`ename` string,
`job` string,
`mgr` int,
`hiredate` string,
`sal` double,
`comm` double,
`deptno` int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:ename,info:job,info:mgr,
info:hiredate,info:sal,info:comm,info:deptno")
TBLPROPERTIES ("hbase.table.name" = "emp");
Hive字段 | HBase字段 | 备注 |
---|---|---|
empno | :key 代表 HBase里的 rowkey,即使用 empno 作为主键 | |
ename | info:ename | info:ename 表示 列族info里的ename列 |
job | info:job | info:job 表示 列族info里的job列 |
mgr | info:mgr | info:mgr 表示 列族info里的mgr列 |
hiredate | info:hiredate | info:hiredate 表示 列族info里的hiredate列 |
sal | info:sal | info:sal 表示 列族info里的sal列 |
comm | info:comm | info:comm 表示 列族info里的comm列 |
deptno | info:deptno | info:deptno表示 列族info里的deptno列 |
使用insert向Hive的表里导入数据,导入的数据就保存在HBase里了。
insert into table hbase_emp select * from emp
向关联了HBase的Hive中导入数据不能使用load,因为load导入数据是一个put操作,而insert操作则会调用MapReduce向HBase里插入数据。
导入成功后:
HBase版本 | Hive版本 | Hadoop版本 |
---|---|---|
0.94 | 1.2.1 | ** |
1.3.1 | 2.x | 2.5.2 |
如果所用的Hive版本是1.2.1,而HBase的版本是1.3.1,则二者不匹配。 |
需要重新编译Hive(1.2.1) 的源码里的hive-hbase-handler-1.2.1.jar(因为只有这一部分不适配)。
在HBase中Hmaster负责监控RegionServer的生命周期,均衡RegionServer的负载。如果Hmaster挂掉了,那么整个HBase集群将陷入不健康的状态,并且此时的工作状态并不会维持太久。所以HBase支持对Hmaster的高可用配置。
[whx@hadoop102 hbase]$ bin/stop-hbase.sh
[whx@hadoop102 hbase]$ touch conf/backup-masters
hadoop101
hadoop102
hadoop103
[whx@hadoop102 conf]$ xsync.sh backup-masters
[whx@hadoop102 conf]$ /opt/module/hbase/bin/start-hbase.sh starting master, logging to /opt/module/hbase/logs/hbase-whx-master-hadoop102.out hadoop102: starting regionserver, logging to /opt/module/hbase/logs/hbase-whx-regionserver-hadoop102.out hadoop101: starting regionserver, logging to /opt/module/hbase/bin/../logs/hbase-whx-regionserver-hadoop101.out hadoop103: starting regionserver, logging to /opt/module/hbase/bin/../logs/hbase-whx-regionserver-hadoop103.out hadoop103: starting master, logging to /opt/module/hbase/bin/../logs/hbase-whx-master-hadoop103.out hadoop101: starting master, logging to /opt/module/hbase/bin/../logs/hbase-whx-master-hadoop101.out hadoop102: master running as process 8814. Stop it first. [whx@hadoop102 conf]$ xcall.sh jps 要执行的命令是jps ----------------------------hadoop101---------------------------------- 6032 HMaster 5621 DataNode 6213 Jps 5501 NameNode 5773 QuorumPeerMain 5903 HRegionServer ----------------------------hadoop102---------------------------------- 9304 Jps 8984 HRegionServer 8601 QuorumPeerMain 8301 DataNode 8814 HMaster ----------------------------hadoop103---------------------------------- 6210 SecondaryNameNode 6085 DataNode 6328 QuorumPeerMain 6841 Jps 6462 HRegionServer 6591 HMaster [whx@hadoop102 conf]$
[whx@hadoop102 conf]$ xcall.sh jps 要执行的命令是jps ----------------------------hadoop101---------------------------------- 6032 HMaster 5621 DataNode 6213 Jps 5501 NameNode 5773 QuorumPeerMain 5903 HRegionServer ----------------------------hadoop102---------------------------------- 9304 Jps 8984 HRegionServer 8601 QuorumPeerMain 8301 DataNode 8814 HMaster ----------------------------hadoop103---------------------------------- 6210 SecondaryNameNode 6085 DataNode 6328 QuorumPeerMain 6841 Jps 6462 HRegionServer 6591 HMaster [whx@hadoop102 conf]$ kill -9 8814 [whx@hadoop102 conf]$
通常情况下,每次建表时,默认只有一个region!随着这个region的数据不断增多,region会自动切分!
HBase> create 'staff1','info','partition1',SPLITS => ['1000','2000','3000','4000']
create 'staff2','info','partition2',{NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}
创建splits.txt文件内容如下:
aaaa
bbbb
cccc
dddd
然后执行:
create 'staff3','partition3',SPLITS_FILE => 'splits.txt'
//自定义算法,产生一系列Hash散列值存储在二维数组中
byte[][] splitKeys = 某个散列值函数
//创建HBaseAdmin实例
HBaseAdmin hAdmin = new HBaseAdmin(HBaseConfiguration.create());
//创建HTableDescriptor实例
HTableDescriptor tableDesc = new HTableDescriptor(tableName);
//通过HTableDescriptor实例和散列值二维数组创建带有预分区的HBase表
hAdmin.createTable(tableDesc, splitKeys);
一条数据的唯一标识就是rowkey,那么这条数据存储于哪个分区,取决于rowkey处于哪个一个预分区的区间内。
设计rowkey的主要目的 ,就是让数据均匀的分布于所有的region中,在一定程度上防止数据倾斜。
原本rowKey为1001的,SHA1后变成:dd01903921ea24941c26a48f2cec24e0bb0e8cc7
原本rowKey为3001的,SHA1后变成:49042c54de64a1e9bf0b33e00245660ef92dc7bd
原本rowKey为5001的,SHA1后变成:7b61dec07e02c188790670af43e717f0f46e8913
在做此操作之前,一般我们会选择从数据集中抽取样本,来决定什么样的rowKey来Hash后作为每个分区的临界值。
时间戳的反转:
20170524000001转成10000042507102
20170524000002转成20000042507102
这样也可以在一定程度上散列逐步put进来的数据。
20170524000001_a12e
20170524000001_93i7
使用字符串拼接设计Rowkey,原则:
例如: 转账的场景
| 流水号 | 转入账户 | 转出账户 | 时间 | 金额 | 用户 |
|–|–|–|–|–|–|–|
| *** |*** |*** |*** |*** |*** |
流水号适合作为RowKey,将流水号再拼接字符串,生成完整的RowKey
格式:
如果流水号在设计时,足够散列,可以使用流水号在前,拼接随机数。
如果流水号不够散列,可以使用函数计算其散列值,或拼接一个散列的值。
举例:如何让一个月的数据,分布到同一个Region!可以取月份的时间,作为计算的参数,使用hash运算,将运算后的字符串,拼接到Rowkey前部。
HBase操作过程中需要大量的内存开销,毕竟Table是可以缓存在内存中的,一般会分配整个可用内存的70%给HBase的Java堆。但是不建议分配非常大的堆内存,因为GC过程持续太久会导致RegionServer处于长期不可用状态,一般16~48G内存就可以了,如果因为框架占用内存过高导致系统内存不足,框架一样会被系统服务拖死。
在/opt/module/hbase/conf/hbase-env.sh 中,编写regionserver进程启动时的JVM参数!
此处dfs.support.append如果设置为false,则HBase的所有功能则没法实现。所以一定设置为true。
当MemStore达到阈值,将Memstore中的数据Flush进Storefile;compact机制则是把flush出来的小文件合并成大的Storefile文件。split则是当Region达到阈值,会把过大的Region一分为二。
HBase中通过“列族”设置过滤器,HBase支持两种布隆过滤器: ROW|ROWCOL
注意: 旧版本,只有get操作,才会用到布隆过滤器,scan用不到! 1.x之后,scan也可用用布隆过滤器,稍微起点作用!
启用布隆过滤器后,会占用额外的内存,布隆过滤器通常是在blockcache和memstore中!
ROW布隆过滤器在计算时,使用每行的Rowkey作为参数,进行判断。举例:info1列族、info2列族:
查询 r1 时,如果命中,判断storefile2中一定没有r1的数据,在storefile1中可能有!
ROWCOL 布隆过滤器在计算时,使用每行的rowkey和column一起作为参数,进行判断!举例: info info1
查询rowkey=r1,只查info1:age=20 列时,如果命中,判断storefile2中一定没有此数据,在storefile1中可能有!
扫描r1所在region的所有列族的memstore,扫memstore时,先通过布隆过滤器判断r1是否存在,如果不存在,就不扫!可能存在,再扫描!
扫描Storefile时,如果storefile中,r1所在的block已经缓存在blockcache中,直接扫blockcache,在扫描blockcache时,先使用布隆过滤器判断r1是否存在,如果不存在,就不扫!可能存在,再扫描!
消息量:发送和接收的消息数超过60亿
将近1000亿条数据的读写
高峰期每秒150万左右操作
整体读取数据占有约55%,写入占有45%
超过2PB的数据,涉及冗余共6PB数据
数据每月大概增长300千兆字节
参考资料:
分布式NoSQL列存储数据库HBASE(二)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。