赞
踩
访问地址 http://h71:60010
注:h71的ip配置在$HBASE_HOME/conf/hbase-site.xml
中
hbase.master.info.port
HBase Master web 界面端口. 设置为 -1 意味着你不想让它运行
默认: 60010
注:新版本改为16010了,所以得访问http://h71:16010了
hbase.master.info.bindAddress
HBase Master web 界面绑定的IP地址
默认: 0.0.0.0
ip映射成主机名:
linux:在/etc/hosts中配置
windows:在windows系统中的C:\Windows\System32\drivers\etc
目录下的hosts文件中配置
192.168.8.71 h71
192.168.8.72 h72
192.168.8.73 h73
注:如果有kerberos认证,需要事先使用相应的keytab进行一下认证(使用kinit命令),认证成功之后再使用hbase shell进入可以使用whoami命令可查看当前用户。
$HBASE_HOME/bin/hbase shell
hbase(main):029:0> whoami
hadoop (auth:SIMPLE)
groups: hadoop
hbase(main):008:0> version
1.0.0-cdh5.5.2, rUnknown, Mon Jan 25 16:33:02 PST 2016
list //看库中所有表
status //查看当前运行服务器状态
exits '表名字' //判断表存在
在关系数据库系统中,命名空间namespace指的是一个表的逻辑分组,同一组中的表有类似的用途。命名空间的概念为即将到来的多租户特性打下基础:
命名空间可以被创建、移除、修改。表和命名空间的隶属关系在在创建表时决定,通过以下格式指定:<namespace>:<table>
Example:hbase shell中创建命名空间、创建命名空间中的表、移除命名空间、修改命名空间:
#Create a namespace create_namespace 'my_ns' #create my_table in my_ns namespace create 'my_ns:my_table', 'fam' #drop namespace drop_namespace 'my_ns' 注意:只有当该空间不存在任何表为空的时候才可以删除,如果存在表的话应该将表删除后再删除该空间,删除表的操作: hbase(main):005:0> disable 'my_ns:my_table' hbase(main):006:0> drop 'my_ns:my_table' #alter namespace alter_namespace 'my_ns', {METHOD => 'set', 'PROPERTY_NAME' => 'PROPERTY_VALUE'} # 列出所有namespace list_namespace # 查看namespace hbase(main):005:0> describe_namespace 'hbase' DESCRIPTION {NAME => 'hbase'} Took 0.0206 seconds => 1
有两个系统内置的预定义命名空间:
#使用默认的命名空间:namespace=my_ns and table qualifier=bar
create 'my_ns:bar', 'fam'
#指定命名空间:namespace=default and table qualifier=bar
create 'bar', 'fam'
语法:create <table>, {NAME => <family>, VERSIONS => <VERSIONS>}
具体命令:
create 't1', {NAME => 'f1', VERSIONS => 5}
create 't1', {NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'}
省略模式建立列族:create 't1', 'f1', 'f2', 'f3'
指定每个列族参数:
create 't1', {NAME => 'f1', VERSIONS => 1, TTL => 2592000, BLOCKCACHE => true}
create 't1', 'f1', {SPLITS => ['10', '20', '30', '40']}
设置不同参数,提升表的读取性能:
create 'lmj_test',
{NAME => 'adn', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROWCOL', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '15768000', MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', ENCODE_ON_DISK => 'true', IN_MEMORY => 'false', BLOCKCACHE => 'false'},
{NAME => 'fixeddim', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROWCOL', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '15768000', MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', ENCODE_ON_DISK => 'true', IN_MEMORY => 'false', BLOCKCACHE => 'false'},
{NAME => 'social', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROWCOL', REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', TTL => '15768000', MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', ENCODE_ON_DISK => 'true', IN_MEMORY => 'false', BLOCKCACHE => 'false'}
每个参数属性都有性能意义,通过合理化的设置可以提升表的性能:
create 'lmj_test',
{NAME => 'adn', BLOOMFILTER => 'ROWCOL', VERSIONS => '1', TTL => '15768000', MIN_VERSIONS => '0', COMPRESSION => 'SNAPPY', BLOCKCACHE => 'false'},
{NAME => 'fixeddim',BLOOMFILTER => 'ROWCOL', VERSIONS => '1', TTL => '15768000', MIN_VERSIONS => '0', COMPRESSION => 'SNAPPY', BLOCKCACHE => 'false'},
{NAME => 'social',BLOOMFILTER => 'ROWCOL', VERSIONS => '1', TTL => '15768000', MIN_VERSIONS => '0',COMPRESSION => 'SNAPPY', BLOCKCACHE => 'false'}
Hbase中通过row和columns确定的为一个存贮单元称为cell,每个 cell都保存着同一份数据的多个版本。在默认的情况下,HBase会存储三个版本的历史数据。但是在实际应用中,出于性能或业务需要,我们可能只有一个或其他数量的版本需要存储。那么如何修改这一默认配置呢?
如果你还没有建表,那你可以在建表时指定VERSIONS来设定版本号,就是存储几个版本的数据。
create '表名',{NAME='列族名1',VERSIONS=给定一个版本号},{NAME='列族名2',VERSIONS=给定的版本号}
如果在建表时没有指定版本号,那么就需要按照以下步骤修改表配置。
在表已经建好的情况下,需要首先将表下线:
disable 'table'
修改表属性(可指定对某个列族修改):
alter 'table' , NAME => 'f', VERSIONS => 1
重新上线(enable):
enable 'table'
根据版本号我们可以指定查询几个版本的数据,目前该表的VERSIONS为10:
hbase(main):060:0> scan 'test_schema1:t2' ROW COLUMN+CELL 0 row(s) Took 0.1183 seconds hbase(main):061:0> put 'test_schema1:t2','101','F:b','huiqtest1' Took 0.0050 seconds hbase(main):062:0> put 'test_schema1:t2','101','F:b','huiqtest2' Took 0.0046 seconds hbase(main):063:0> put 'test_schema1:t2','101','F:b','huiqtest3' Took 0.0157 seconds hbase(main):064:0> scan 'test_schema1:t2' ROW COLUMN+CELL 101 column=F:b, timestamp=1627353050875, value=huiqtest3 1 row(s) Took 0.0048 seconds hbase(main):065:0> scan 'test_schema1:t2', {VERSIONS=>3} ROW COLUMN+CELL 101 column=F:b, timestamp=1627353050875, value=huiqtest3 101 column=F:b, timestamp=1627353048782, value=huiqtest2 101 column=F:b, timestamp=1627353045389, value=huiqtest1 1 row(s) Took 0.0097 seconds hbase(main):066:0> scan 'test_schema1:t2', {COLUMNS => ['F:a', 'F:b'], VERSIONS=>3} ROW COLUMN+CELL 101 column=F:b, timestamp=1627353050875, value=huiqtest3 101 column=F:b, timestamp=1627353048782, value=huiqtest2 101 column=F:b, timestamp=1627353045389, value=huiqtest1 1 row(s) Took 0.0088 seconds hbase(main):068:0> get 'test_schema1:t2','101','F:b' COLUMN CELL F:b timestamp=1627353050875, value=huiqtest3 1 row(s) Took 0.0154 seconds hbase(main):069:0> get 'test_schema1:t2','101', {COLUMNS => ['F:b'], VERSIONS=>3} COLUMN CELL F:b timestamp=1627353050875, value=huiqtest3 F:b timestamp=1627353048782, value=huiqtest2 F:b timestamp=1627353045389, value=huiqtest1 1 row(s) Took 0.0163 seconds hbase(main):070:0> get 'test_schema1:t2','101', {COLUMNS => 'F:b', VERSIONS=>3} COLUMN CELL F:b timestamp=1627353050875, value=huiqtest3 F:b timestamp=1627353048782, value=huiqtest2 F:b timestamp=1627353045389, value=huiqtest1 1 row(s) Took 0.0044 seconds hbase(main):073:0> put 'test_schema1:t2','101','F:a','101' Took 0.0660 seconds hbase(main):077:0> get 'test_schema1:t2','101', {COLUMNS => ['F:a', 'F:b'], VERSIONS=>2} COLUMN CELL F:a timestamp=1627353603902, value=101 F:b timestamp=1627353050875, value=huiqtest3 F:b timestamp=1627353048782, value=huiqtest2 1 row(s) Took 0.0053 seconds # 删除指定版本的数据 hbase(main):078:0> delete 'test_schema1:t2','101','F:b',1627353048782 Took 0.0136 seconds hbase(main):079:0> get 'test_schema1:t2','101', {COLUMNS => ['F:a', 'F:b'], VERSIONS=>2} COLUMN CELL F:a timestamp=1627353603902, value=101 F:b timestamp=1627353050875, value=huiqtest3 F:b timestamp=1627353045389, value=huiqtest1 1 row(s) Took 0.0115 seconds
hbase(main):180:0> put 'scores','zhangsan01','course:math','99'
hbase(main):181:0> put 'scores','zhangsan01','course:art','90'
hbase(main):182:0> put 'scores','zhangsan01','grade:','101'
hbase(main):184:0> put 'scores','zhangsan02','course:math','66'
hbase(main):185:0> put 'scores','zhangsan02','course:art','60'
hbase(main):186:0> put 'scores','zhangsan02','grade:','102'
hbase(main):201:0> put 'scores','lisi01','course:math','89'
hbase(main):202:0> put 'scores','lisi01','course:art','89'
hbase(main):203:0> put 'scores','lisi01','grade:','201'
更新表数据与插入表数据一样,都使用put命令,如下:
# 语法:
put 'tablename','row','colfamily:colname','newvalue'
# 更新emp表中row为1,将列为personal data:city的值更改为bj
put 'emp','1','personal data:city','bj'
如何在hbase里面复制出一张表呢?用快照复制:
步骤1:创建表的快照
hbase(main):204:0> snapshot 'scores' , 'snapshot_scores'
步骤2:从快照克隆出一张新的表
hbase(main):205:0> clone_snapshot 'snapshot_scores','scores_2'
如果加表空间的话:
hbase(main):206:0> snapshot 'test_schema1:t1', 'snapshot_t1'
hbase(main):207:0> clone_snapshot 'snapshot_t1','test_schema1:t2'
# 查看快照
hbase(main):208:0> list_snapshots
SNAPSHOT TABLE + CREATION TIME
snapshot_t1 test_schema1:t1 (2021-07-09 16:28:03 +0800)
1 row(s)
Took 0.5443 seconds
=> ["snapshot_t1"]
# 删除快照
hbase(main):209:0> delete_snapshot 'snapshot_t1'
注意:0.94.x版本之前是不支持snapshot快照命令的。
请参考我的另一篇文章:hbase数据查询及过滤器详细使用
语法:delete <table>, <rowkey>, <family:column> , <timestamp>
(必须指定列名,删除其所有版本数据)
delete 'scores','zhangsan01','course:math'
语法:deleteall <table>, <rowkey>, <family:column> , <timestamp>
deleteall 'scores','zhangsan02'
注:Put,Delete,Get,Scan这四个类都是org.apache.hadoop.hbase.client的子类,可以到官网API去查看详细信息。
目前是没有找到这样的命令,只能配合 shll 脚本进行批量删除。参考:【hbase】按时间段批量删除hbase数据
touch record.txt
touch delete.sh
echo "scan 'heheda',{STARTROW=>'haha_1649088000',STOPROW=>'haha_1649174399', COLUMNS => 'DATA:qie'}" | hbase shell > record.txt
echo '#!/bin/bash ' >> delete.sh
echo "exec hbase shell <<EOF " >> delete.sh
cat record.txt | awk '{print "deleteall '\'heheda\''", ",", "'\''"$1"'\''"}' >> delete.sh
echo "EOF " >> delete.sh
sh delete.sh
Java Api:
/** * @Description: 根据 rowKey 批量删除数据 * @param properties * @param tableName * @param tableName */ public static void deleteDataBatch(Dataset<Row> dataDataset, Properties properties, String tableName) { // 创建HBase连接 Connection connection = getHBaseConnect(properties); TableName name=TableName.valueOf(tableName); Table table = null; try { table = connection.getTable(name); } catch (IOException e) { e.printStackTrace(); } JavaRDD<Row> dataRDD = dataDataset.toJavaRDD(); dataRDD.foreachPartition((VoidFunction<Iterator<Row>>) rowIterator -> { while (rowIterator.hasNext()) { Row next = rowIterator.next(); String studentId = next.getAs("student_id"); String num = next.getAs("num"); String rowKey = studentId + "_" + num; System.out.println("rowKey-->"+rowKey); deletes.add(new Delete(Bytes.toBytes(rowKey))); } }); try { table.delete(deletes); table.close(); } catch (IOException e) { e.printStackTrace(); } }
# 每100条显示一次,缓存区为500
count 'scores', {INTERVAL => 100, CACHE => 500}
自己写java 实现:
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.TableName; import org.apache.hadoop.hbase.client.*; import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter; import org.apache.hadoop.util.StopWatch; import java.io.IOException; import java.util.Properties; import java.util.concurrent.TimeUnit; public void rowCountByScanFilter(String tablename){ long rowCount = 0; try { //计时 StopWatch stopWatch = new StopWatch(); stopWatch.start(); TableName name=TableName.valueOf(tablename); //connection为类静态变量 Table table = connection.getTable(name); Scan scan = new Scan(); //FirstKeyOnlyFilter只会取得每行数据的第一个kv,提高count速度 scan.setFilter(new FirstKeyOnlyFilter()); ResultScanner rs = table.getScanner(scan); for (Result result : rs) { rowCount += result.size(); } stopWatch.stop(); System.out.println("RowCount: " + rowCount); System.out.println("统计耗时:" + stopWatch.now(TimeUnit.SECONDS)); } catch (Throwable e) { e.printStackTrace(); } }
注:其实官方已经封装好了相应的方法了:hbase org.apache.hadoop.hbase.mapreduce.RowCounter '表名'
。更深入的研究可参考我的另一篇文章:Hbase进行RowCount统计;也可以参考其他人的文章:Hbase查询表大小的4个方式
truncate 'scores'
先disable后enable
# 例如:修改表scores的cf的TTL为180天 hbase(main):017:0> disable 'scores' hbase(main):018:0> alter 'scores',{NAME=>'grade',TTL=>'15552000'},{NAME=>'course', TTL=>'15552000'} Updating all regions with the new schema... 1/1 regions updated. Done. Updating all regions with the new schema... 1/1 regions updated. Done. 0 row(s) in 2.2200 seconds #改变多版本号: hbase(main):019:0> alter 'scores',{NAME=>'grade',VERSIONS=>3} Updating all regions with the new schema... 0/1 regions updated. 1/1 regions updated. Done. 0 row(s) in 2.4020 seconds 注:网上都说修改表结构必须先先disable后enable,但是我没有做这个操作,直接alter也成功了啊,不知道这样做有没有什么影响,目前还不太了解。 # 增加列族: hbase(main):020:0> alter 'scores', NAME=>'info' # 删除列族: alter 'scores', NAME=> 'info', METHOD => 'delete' # 或者 alter 'scores', 'delete' => 'info' hbase(main):020:0> enable 'scores'
hbase:meta
表会记录元数据信息,而这些数据在创建时也会有timestamp属性。rowkey就是表名(格式是namespace:table
)看一下查到的数据的时间戳,然后把时间戳转为时间串。
此外,也可以到zookeeper中查看相关信息,使用get /hbase/table/表名
(格式是namespace:table
)查询到的ctime属性就是创建时间了。
scan 'hbase:meta',{FILTER=>"PrefixFilter('table_name')"}
info:regioninfo
此限定符包含 STARTKEY 和 ENDKEY。
info:server
此限定符包含 region 服务器的信息
hbase(main):044:0> disable 't2'
hbase(main):045:0> drop 't2'
grant 'hadoop','RW','scores' #分配给用户hadoop表scores的读写权限
注意:一开始我分配权限的时候总是报错:
hbase(main):038:0> grant 'hadoop','RW','scores'
ERROR: DISABLED: Security features are not available
解决:
[hadoop@h71 ~]$ vi hbase-1.0.0-cdh5.5.2/conf/hbase-site.xml 添加: <property> <name>hbase.superuser</name> <value>root,hadoop</value> </property> <property> <name>hbase.coprocessor.region.classes</name> <value>org.apache.hadoop.hbase.security.access.AccessController</value> </property> <property> <name>hbase.coprocessor.master.classes</name> <value>org.apache.hadoop.hbase.security.access.AccessController</value> </property> <property> <name>hbase.rpc.engine</name> <value>org.apache.hadoop.hbase.ipc.SecureRpcEngine</value> </property> <property> <name>hbase.security.authorization</name> <value>true</value> </property>
同步hbase配置(我的hbase集群为h71(主),h72(从),h73(从)):
[hadoop@h71 ~]$ cat /home/hadoop/hbase-1.0.0-cdh5.5.2/conf/regionservers|xargs -i -t scp /home/hadoop/hbase-1.0.0-cdh5.5.2/conf/hbase-site.xml hadoop@{}:/home/hadoop/hbase-1.0.0-cdh5.5.2/conf/hbase-site.xml
scp /home/hadoop/hbase-1.0.0-cdh5.5.2/conf/hbase-site.xml hadoop@h72:/home/hadoop/hbase-1.0.0-cdh5.5.2/conf/hbase-site.xml
hbase-site.xml 100% 2038 2.0KB/s 00:00
scp /home/hadoop/hbase-1.0.0-cdh5.5.2/conf/hbase-site.xml hadoop@h73:/home/hadoop/hbase-1.0.0-cdh5.5.2/conf/hbase-site.xml
hbase-site.xml 100% 2038 2.0KB/s 00:00
重启hbase集群。
注:
HBase提供的五个权限标识符:RWXCA,分别对应着READ(‘R’)、WRITE(‘W’)、EXEC(‘X’)、CREATE(‘C’)、ADMIN(‘A’)
HBase提供的安全管控级别包括:
和关系数据库一样,权限的授予和回收都使用grant和revoke,但格式有所不同。grant语法格式:grant user permissions table column_family column_qualifier
hbase(main):010:0> user_permission 'scores'
User Namespace,Table,Family,Qualifier:Permission
hadoop default,scores,,: [Permission: actions=READ,WRITE]
1 row(s) in 0.2530 seconds
hbase(main):006:0> revoke 'hadoop','scores'
既然是shell命令,当然也可以把所有的hbase shell命令写入到一个文件内,想Linux shell脚本程序那样去顺序的执行所有命令。如同写linux shell,把所有hbase shell命令书写在一个文件内,然后执行如下命令即可:
[hadoop@h71 hbase-1.0.0-cdh5.5.2]$ vi hehe.txt(这个文件名随便起,正规点的话可以起test.hbaseshell)
create 'hui','cf'
list
disable 'hui'
drop 'hui'
list
[hadoop@h71 hbase-1.0.0-cdh5.5.2]$ bin/hbase shell hehe.txt
迁移原集群的表:
步骤一:在目标集群创建对应表
步骤二:Export阶段:将原集群表数据Scan并转换成Sequence File到Hdfs上,因Export也是依赖于MR的,如果用到独立的MR集群的话,只要保证在MR集群上关于HBase的配置和原集群一样且能和原集群策略打通,就可直接用Export命令。若需要同步多个版本数据,可以指定versions参数,否则默认同步最新版本的数据,还可以指定数据起始结束时间,使用如下:
# output_hdfs_path可以直接是目标集群的hdfs路径,也可以是原集群的HDFS路径,如果需要指定版本号,起始结束时间
hbase org.apache.hadoop.hbase.mapreduce.Export <tableName> <ouput_hdfs_path> <versions> <starttime> <endtime>
# 实操:
[root@node01 ~]# hbase org.apache.hadoop.hbase.mapreduce.Export test_schema1:t2 /huiq 99999
# 注意:执行该命令前/huiq目录不能存在
可选参数:
步骤三:Import阶段:将原集群Export出的SequenceFile导到目标集群对应表,使用如下:
# 如果原数据是存在原集群HDFS,此处input_hdfs_path可以是原集群的HDFS路径,如果原数据存在目标集群HDFS,则为目标集群的HDFS路径
hbase org.apache.hadoop.hbase.mapreduce.Import <tableName> <input_hdfs_path>
# 实操:
[hdfs@bigdatanode01 ~]$ hbase org.apache.hadoop.hbase.mapreduce.Import test_schema1:t2 hdfs://192.110.110.110:8020/huiq
注意:在执行步骤三的时候可能报错
解决:切换到hdfs用户(su - hdfs
)再执行Import命令即可
进行有条件的导出操作:
来自:hbase的 export以及import工具使用示例 + 时间区间+ key前缀
先看看 hbase export
的使用说明:
[root@heheda ~]# hbase org.apache.hadoop.hbase.mapreduce.Export -help ERROR: Wrong number of arguments: 1 Usage: Export [-D <property=value>]* <tablename> <outputdir> [<versions> [<starttime> [<endtime>]] [^[regex pattern] or [Prefix] to filter]] Note: -D properties will be applied to the conf used. For example: -D mapreduce.output.fileoutputformat.compress=true -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec -D mapreduce.output.fileoutputformat.compress.type=BLOCK Additionally, the following SCAN properties can be specified to control/limit what is exported.. -D hbase.mapreduce.scan.column.family=<family1>,<family2>, ... -D hbase.mapreduce.include.deleted.rows=true -D hbase.mapreduce.scan.row.start=<ROWSTART> -D hbase.mapreduce.scan.row.stop=<ROWSTOP> -D hbase.client.scanner.caching=100 -D hbase.export.visibility.labels=<labels> For tables with very wide rows consider setting the batch size as below: -D hbase.export.scanner.batch=10 -D hbase.export.scanner.caching=100 -D mapreduce.job.name=jobName - use the specified mapreduce job name for the export For MR performance consider the following properties: -D mapreduce.map.speculative=false -D mapreduce.reduce.speculative=false
下面是导出 user 表中 version=1,start_time=0, end_time=99999999999 key的prefix=row222的用户。
[hdfs@test-hadoop-slave ~]$ hbase org.apache.hadoop.hbase.mapreduce.Export 'users' /test/source/fromhbasetohdfs/users 1 0 999999999999999 '^^(?!row222)'
前面使用的Import和Export工具,只能在HBase内部完成闭环,也就是导入和导出使用的文件,是特殊格式,只能用于HBase表。但是,我们经常会有需求,将常见的CSV文件,导入到HBase中。这个工具是HBase内置的,ImportTSV工具:https://hbase.apache.org/book.html#importtsv.options
使用方式:
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.columns=HBASE_ROW_KEY,C1:code,C2:monry,C3:xxx,C4:yyy table_name HDFS路径
-Dimporttsv.separator 用来指定分隔符
-Dimporttsv.columns 用来指定列的匹配,按照顺序,如果某一个CSV的列要作为Rowkey使用,那么填入HBASE_ROW_KEY占位即可
比如:
-Dimporttsv.columns=HBASE_ROW_KEY,C1:code,C2:money,C3:xxx,C4:yyy
表示:
- CSV的第1个列:是rowkey
- CSV的第2个列:是C1列族的code二级列
- CSV的第3个列:是C2列族的money列
......
如果,假设,第二个列是rowkey可以:
-Dimporttsv.columns=C1:code,HBASE_ROW_KEY,C2:money,C3:xxx,C4:yyy
注:既然可以导入 CSV 文件,那么导出为 CSV 文件不也应该有吗,但事实是没发现有,在网上搜索了半天也没找到(如果有的话大家可以告一下),有一次公司需要将 Hbase 中十多天的数据导出为 CSV 文件,结果一时半会儿没有又快又好的方法,要不就是先查出数据再手动插入到 CSV 文件中,要不就是自己编写代码实现。
参考:hbase hbck
hbase hbck
是hbase自带的一项非常实用的工具,很多hbase中出现的问题都可以尝试用hbase hbck
修复。
hbck 是一个检查和修复表,region一致性和完整性的工具。新版本的hbck从 hdfs目录、META、RegionServer 这三处获得region的Table和Region的相关信息,根据这些信息判断并尝试进行repair。
新版本的 hbck 可以修复各种错误,修复选项是:(请注意选项后面是否需要加具体表名)
(1)-fix 向下兼容用,被-fixAssignments替代 (2)-fixAssignments 用于修复region assignments错误 (3)-fixMeta 用于修复meta表的问题,前提是HDFS上面的region info信息有并且正确。 (4)-fixHdfsHoles 修复region holes(空洞,某个区间没有region)问题 (5)-fixHdfsOrphans 修复Orphan region(hdfs上面没有.regioninfo的region) (6)-fixHdfsOverlaps 修复region overlaps(区间重叠)问题 (7)-fixVersionFile 修复缺失hbase.version文件的问题 (8)-maxMerge <n> (n默认是5) 当region有重叠是,需要合并region,一次合并的region数最大不超过这个值。 (9)-sidelineBigOverlaps 当修复region overlaps问题时,允许跟其他region重叠次数最多的一些region不参与(修复后,可以把没有参与的数据通过bulk load加载到相应的region) (10)-maxOverlapsToSideline <n> (n默认是2) 当修复region overlaps问题时,一组里最多允许多少个region不参与。由于选项较多,所以有两个简写的选项 (11)-repair 相当于-fixAssignments -fixMeta -fixHdfsHoles -fixHdfsOrphans -fixHdfsOverlaps -fixVersionFile -sidelineBigOverlaps。如前所述,-repair 打开所有的修复选项,相当于-fixAssignments -fixMeta -fixHdfsHoles -fixHdfsOrphans -fixHdfsOverlaps -fixVersionFile -sidelineBigOverlaps (12)-repairHoles 相当于-fixAssignments -fixMeta -fixHdfsHoles -fixHdfsOrphans 示例情景: Q:缺失hbase.version文件 A:加上选项 -fixVersionFile 解决 Q:如果一个region即不在META表中,又不在hdfs上面,但是在regionserver的online region集合中 A:加上选项 -fixAssignments 解决 Q:如果一个region在META表中,并且在regionserver的online region集合中,但是在hdfs上面没有 A:加上选项 -fixAssignments -fixMeta 解决,( -fixAssignments告诉regionserver close region),( -fixMeta删除META表中region的记录) Q:如果一个region在META表中没有记录,没有被regionserver服务,但是在hdfs上面有 A:加上选项 -fixMeta -fixAssignments 解决,( -fixAssignments 用于assign region),( -fixMeta用于在META表中添加region的记录) Q:如果一个region在META表中没有记录,在hdfs上面有,被regionserver服务了 A:加上选项 -fixMeta 解决,在META表中添加这个region的记录,先undeploy region,后assign。-fixMeta,如果hdfs上面没有,那么从META表中删除相应的记录,如果hdfs上面有,在META表中添加上相应的记录信息 Q:如果一个region在META表中有记录,但是在hdfs上面没有,并且没有被regionserver服务 A:加上选项 -fixMeta 解决,删除META表中的记录 Q:如果一个region在META表中有记录,在hdfs上面也有,table不是disabled的,但是这个region没有被服务 A:加上选项 -fixAssignments 解决,assign这个region。-fixAssignments,用于修复region没有assign、不应该assign、assign了多次的问题 Q:如果一个region在META表中有记录,在hdfs上面也有,table是disabled的,但是这个region被某个regionserver服务了 A:加上选项 -fixAssignments 解决,undeploy这个region Q:如果一个region在META表中有记录,在hdfs上面也有,table不是disabled的,但是这个region被多个regionserver服务了 A:加上选项 -fixAssignments 解决,通知所有regionserver close region,然后assign region Q:如果一个region在META表中,在hdfs上面也有,也应该被服务,但是META表中记录的regionserver和实际所在的regionserver不相符 A:加上选项 -fixAssignments 解决 Q:region holes A:加上 -fixHdfsHoles ,创建一个新的空region,填补空洞,但是不assign 这个 region,也不在META表中添加这个region的相关信息。修复region holes时,-fixHdfsHoles 选项只是创建了一个新的空region,填补上了这个区间,还需要加上-fixAssignments -fixMeta 来解决问题,( -fixAssignments 用于assign region),( -fixMeta用于在META表中添加region的记录),所以有了组合拳 -repairHoles 修复region holes,相当于-fixAssignments -fixMeta -fixHdfsHoles -fixHdfsOrphans Q:region在hdfs上面没有.regioninfo文件 A:加上选项 -fixHdfsOrphans 解决 Q:region overlaps A:需要加上 -fixHdfsOverlaps
该命令输出如下:
[root@node01 spark2]# hbase hbck SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-315/phoenix/phoenix-5.0.0.3.1.4.0-315-server.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/hdp/3.1.4.0-315/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 2021-07-23 09:25:41,838 INFO [main] zookeeper.RecoverableZooKeeper: Process identifier=hbase Fsck connecting to ZooKeeper ensemble=node01:2181,node02:2181,node03:2181 2021-07-23 09:25:41,850 INFO [main] zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.6-315--1, built on 08/23/2019 05:02 GMT 2021-07-23 09:25:41,850 INFO [main] zookeeper.ZooKeeper: Client environment:host.name=node01 2021-07-23 09:25:41,850 INFO [main] zookeeper.ZooKeeper: Client environment:java.version=1.8.0_231 2021-07-23 09:25:41,850 INFO [main] zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation 2021-07-23 09:25:41,850 INFO [main] zookeeper.ZooKeeper: Client environment:java.home=/opt/tools/jdk1.8.0_231/jre 2021-07-23 09:25:41,851 INFO [main] zookeeper.ZooKeeper: Client environment:java.class.path=/etc/hbase/conf:/opt/tools/jdk1.8.0_231/lib/tools.jar:/usr/hdp/3.1.4.0-315/hbase:/usr/hdp/3.1.4.0-315/hbase/lib/animal-sniffer-annotations-1.17.jar:/usr/hdp/3.1.4.0-315/hbase/lib/aopalliance-1.0.jar:/usr/hdp/3.1.4.0-315/hbase/lib/aopalliance-repackaged-2.5.0-b32.jar:/usr/hdp/3.1.4.0-315/hbase/lib/atlas-plugin-classloader-1.1.0.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/hbase/lib/audience-annotations-0.5.0.jar:/usr/hdp/3.1.4.0-315/hbase/lib/avro-1.7.7.jar:/usr/hdp/3.1.4.0-315/hbase/lib/aws-java-sdk-bundle-1.11.375.jar:/usr/hdp/3.1.4.0-315/hbase/lib/checker-qual-2.8.1.jar:/usr/hdp/3.1.4。。。。。。。(这里依赖太多就省略了) 2021-07-23 09:25:41,882 INFO [main] zookeeper.ZooKeeper: Client environment:java.library.path=:/usr/hdp/3.1.4.0-315/hadoop/lib/native/Linux-amd64-64:/usr/hdp/3.1.4.0-315/hadoop/lib/native/Linux-amd64-64:/usr/hdp/3.1.4.0-315/hadoop/lib/native 2021-07-23 09:25:41,882 INFO [main] zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp 2021-07-23 09:25:41,882 INFO [main] zookeeper.ZooKeeper: Client environment:java.compiler=<NA> 2021-07-23 09:25:41,882 INFO [main] zookeeper.ZooKeeper: Client environment:os.name=Linux 2021-07-23 09:25:41,882 INFO [main] zookeeper.ZooKeeper: Client environment:os.arch=amd64 2021-07-23 09:25:41,882 INFO [main] zookeeper.ZooKeeper: Client environment:os.version=3.10.0-1160.11.1.el7.x86_64 2021-07-23 09:25:41,882 INFO [main] zookeeper.ZooKeeper: Client environment:user.name=root 2021-07-23 09:25:41,882 INFO [main] zookeeper.ZooKeeper: Client environment:user.home=/root 2021-07-23 09:25:41,882 INFO [main] zookeeper.ZooKeeper: Client environment:user.dir=/usr/hdp/3.1.4.0-315/spark2 2021-07-23 09:25:41,885 INFO [main] zookeeper.ZooKeeper: Initiating client connection, connectString=node01:2181,node02:2181,node03:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@604c5de8 HBaseFsck command line options: 2021-07-23 09:25:41,913 INFO [main] util.HBaseFsck: Launching hbck 2021-07-23 09:25:41,916 INFO [main-SendThread(node01:2181)] zookeeper.ClientCnxn: Opening socket connection to server node01/10.3.2.24:2181. Will not attempt to authenticate using SASL (unknown error) 2021-07-23 09:25:41,925 INFO [main-SendThread(node01:2181)] zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.3.2.24:58562, server: node01/10.3.2.24:2181 2021-07-23 09:25:41,966 INFO [main-SendThread(node01:2181)] zookeeper.ClientCnxn: Session establishment complete on server node01/10.3.2.24:2181, sessionid = 0x17aadd367a40a27, negotiated timeout = 60000 2021-07-23 09:25:41,995 INFO [main] zookeeper.ReadOnlyZKClient: Connect 0x4a11eb84 to node01:2181,node02:2181,node03:2181 with session timeout=90000ms, retries 6, retry interval 1000ms, keepAlive=60000ms 2021-07-23 09:25:42,000 INFO [ReadOnlyZKClient-node01:2181,node02:2181,node03:2181@0x4a11eb84] zookeeper.ZooKeeper: Initiating client connection, connectString=node01:2181,node02:2181,node03:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$$Lambda$14/781735981@6b9ac39a 2021-07-23 09:25:42,002 INFO [ReadOnlyZKClient-node01:2181,node02:2181,node03:2181@0x4a11eb84-SendThread(node03:2181)] zookeeper.ClientCnxn: Opening socket connection to server node03/10.3.2.26:2181. Will not attempt to authenticate using SASL (unknown error) 2021-07-23 09:25:42,004 INFO [ReadOnlyZKClient-node01:2181,node02:2181,node03:2181@0x4a11eb84-SendThread(node03:2181)] zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.3.2.24:49356, server: node03/10.3.2.26:2181 2021-07-23 09:25:42,051 INFO [ReadOnlyZKClient-node01:2181,node02:2181,node03:2181@0x4a11eb84-SendThread(node03:2181)] zookeeper.ClientCnxn: Session establishment complete on server node03/10.3.2.26:2181, sessionid = 0x37aad46c7b71234, negotiated timeout = 60000 Version: 2.0.2.3.1.4.0-315 2021-07-23 09:25:42,797 INFO [main] util.HBaseFsck: Computing mapping of all store files . 2021-07-23 09:25:43,501 INFO [main] util.HBaseFsck: Validating mapping using HDFS state 2021-07-23 09:25:43,502 INFO [main] util.HBaseFsck: Computing mapping of all link files . 2021-07-23 09:25:43,691 INFO [main] util.HBaseFsck: Validating mapping using HDFS state Number of live region servers: 1 Number of dead region servers: 1 Master: node01,16000,1626086603004 Number of backup masters: 2 Average load: 50.0 Number of requests: 207161 Number of regions: 50 Number of regions in transition: 0 2021-07-23 09:25:44,100 INFO [main] util.HBaseFsck: Loading regionsinfo from the hbase:meta table Number of empty REGIONINFO_QUALIFIER rows in hbase:meta: 0 2021-07-23 09:25:44,226 INFO [main] util.HBaseFsck: getTableDescriptors == tableNames => [SYSTEM.FUNCTION, hbase_test, suntest:t2, USER, suntest:t1, SYSTEM.LOG, atlas_janus, kylin_metadata, test_schema1:t2, SYSTEM.STATS, SUNTEST.USER, hbase:namespace, test_schema1:t1, KYLIN_6OMP0DMLFQ, SYSTEM.CATALOG, ATLAS_ENTITY_AUDIT_EVENTS, SYSTEM.MUTEX, SYSTEM.SEQUENCE] 2021-07-23 09:25:44,228 INFO [main] zookeeper.ReadOnlyZKClient: Connect 0x1ddd3478 to node01:2181,node02:2181,node03:2181 with session timeout=90000ms, retries 6, retry interval 1000ms, keepAlive=60000ms 2021-07-23 09:25:44,229 INFO [ReadOnlyZKClient-node01:2181,node02:2181,node03:2181@0x1ddd3478] zookeeper.ZooKeeper: Initiating client connection, connectString=node01:2181,node02:2181,node03:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$$Lambda$14/781735981@6b9ac39a 2021-07-23 09:25:44,230 INFO [ReadOnlyZKClient-node01:2181,node02:2181,node03:2181@0x1ddd3478-SendThread(node02:2181)] zookeeper.ClientCnxn: Opening socket connection to server node02/10.3.2.25:2181. Will not attempt to authenticate using SASL (unknown error) 2021-07-23 09:25:44,232 INFO [ReadOnlyZKClient-node01:2181,node02:2181,node03:2181@0x1ddd3478-SendThread(node02:2181)] zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.3.2.24:43240, server: node02/10.3.2.25:2181 2021-07-23 09:25:44,273 INFO [ReadOnlyZKClient-node01:2181,node02:2181,node03:2181@0x1ddd3478-SendThread(node02:2181)] zookeeper.ClientCnxn: Session establishment complete on server node02/10.3.2.25:2181, sessionid = 0x27aa2b0324412c8, negotiated timeout = 60000 2021-07-23 09:25:44,444 INFO [main] client.ConnectionImplementation: Closing master protocol: MasterService 2021-07-23 09:25:44,445 INFO [main] zookeeper.ReadOnlyZKClient: Close zookeeper connection 0x1ddd3478 to node01:2181,node02:2181,node03:2181 Number of Tables: 18 2021-07-23 09:25:44,455 INFO [main] util.HBaseFsck: Loading region directories from HDFS 2021-07-23 09:25:44,472 INFO [ReadOnlyZKClient-node01:2181,node02:2181,node03:2181@0x1ddd3478] zookeeper.ZooKeeper: Session: 0x27aa2b0324412c8 closed 2021-07-23 09:25:44,473 INFO [ReadOnlyZKClient-node01:2181,node02:2181,node03:2181@0x1ddd3478-EventThread] zookeeper.ClientCnxn: EventThread shut down .. 2021-07-23 09:25:44,696 INFO [main] util.HBaseFsck: Loading region information from HDFS . 2021-07-23 09:25:47,212 INFO [main] util.HBaseFsck: Checking and fixing region consistency 2021-07-23 09:25:47,286 INFO [main] util.HBaseFsck: Handling overlap merges in parallel. set hbasefsck.overlap.merge.parallel to false to run serially. Summary: Table test_schema1:t1 is okay. Number of regions: 1 Deployed on: node03,16020,1625705330661 Table test_schema1:t2 is okay. Number of regions: 1 Deployed on: node03,16020,1625705330661 Table SUNTEST.USER is okay. Number of regions: 1 Deployed on: node03,16020,1625705330661 Table ATLAS_ENTITY_AUDIT_EVENTS is okay. Number of regions: 1 Deployed on: node03,16020,1625705330661 Table SYSTEM.CATALOG is okay. Number of regions: 1 Deployed on: node03,16020,1625705330661 Table USER is okay. Number of regions: 1 Deployed on: node03,16020,1625705330661 Table SYSTEM.SEQUENCE is okay. Number of regions: 1 Deployed on: node03,16020,1625705330661 Table SYSTEM.LOG is okay. Number of regions: 32 Deployed on: node03,16020,1625705330661 Table SYSTEM.FUNCTION is okay. Number of regions: 1 Deployed on: node03,16020,1625705330661 Table SYSTEM.MUTEX is okay. Number of regions: 1 Deployed on: node03,16020,1625705330661 Table SYSTEM.STATS is okay. Number of regions: 1 Deployed on: node03,16020,1625705330661 0 inconsistencies detected. Status: OK 2021-07-23 09:25:47,557 INFO [main] zookeeper.ZooKeeper: Session: 0x17aadd367a40a27 closed 2021-07-23 09:25:47,557 INFO [main-EventThread] zookeeper.ClientCnxn: EventThread shut down 2021-07-23 09:25:47,557 INFO [main] client.ConnectionImplementation: Closing master protocol: MasterService 2021-07-23 09:25:47,558 INFO [main] zookeeper.ReadOnlyZKClient: Close zookeeper connection 0x4a11eb84 to node01:2181,node02:2181,node03:2181 2021-07-23 09:25:47,597 INFO [ReadOnlyZKClient-node01:2181,node02:2181,node03:2181@0x4a11eb84] zookeeper.ZooKeeper: Session: 0x37aad46c7b71234 closed 2021-07-23 09:25:47,597 INFO [ReadOnlyZKClient-node01:2181,node02:2181,node03:2181@0x4a11eb84-EventThread] zookeeper.ClientCnxn: EventThread shut down
注:目前该工具好像升级了,可参考:技术篇-HBase 2.0 之修复工具 HBCK2 运维指南 这篇文章来源于:技术篇-HBase 2.0 之修复工具 HBCK2 运维指南
参考:如何在 HBase Shell 命令行正常查看十六进制编码的中文
hbase(main):050:0> scan 'test' ROW COLUMN+CELL row-1 column=f:c1, timestamp=1587984555307, value=\xE7\xA6\x85\xE5\x85\x8B row-2 column=f:c2, timestamp=1587984555307, value=HBase\xE8\x80\x81\xE5\xBA\x97 row-3 column=f:c3, timestamp=1587984555307, value=HBase\xE5\xB7\xA5\xE4\xBD\x9C\xE7\xAC\x94\xE8\xAE\xB0 row-4 column=f:c4, timestamp=1587984555307, value=\xE6\x88\x91\xE7\x88\xB1\xE4\xBD\xA0\xE4\xB8\xAD\xE5\x9B\xBD\xEF\xBC\x81 4 row(s) in 0.0190 seconds hbase(main):051:0> scan 'test', {FORMATTER => 'toString'} ROW COLUMN+CELL row-1 column=f:c1, timestamp=1587984555307, value=禅克 row-2 column=f:c2, timestamp=1587984555307, value=HBase老店 row-3 column=f:c3, timestamp=1587984555307, value=HBase工作笔记 row-4 column=f:c4, timestamp=1587984555307, value=我爱你中国! 4 row(s) in 0.0170 seconds hbase(main):052:0> scan 'test', {FORMATTER => 'toString',LIMIT=>1,COLUMN=>'f:c4'} ROW COLUMN+CELL row-4 column=f:c4, timestamp=1587984555307, value=我爱你中国! 1 row(s) in 0.0180 seconds hbase(main):053:0> scan 'test', {FORMATTER_CLASS => 'org.apache.hadoop.hbase.util.Bytes', FORMATTER => 'toString'} ROW COLUMN+CELL row-1 column=f:c1, timestamp=1587984555307, value=禅克 row-2 column=f:c2, timestamp=1587984555307, value=HBase老店 row-3 column=f:c3, timestamp=1587984555307, value=HBase工作笔记 row-4 column=f:c4, timestamp=1587984555307, value=我爱你中国! 4 row(s) in 0.0220 seconds hbase(main):054:0> scan 'test', {FORMATTER_CLASS => 'org.apache.hadoop.hbase.util.Bytes', FORMATTER => 'toString', COLUMN=>'f:c4'} ROW COLUMN+CELL row-4 column=f:c4, timestamp=1587984555307, value=我爱你中国! 1 row(s) in 0.0220 seconds hbase(main):004:0> scan 'test', {COLUMNS => ['f:c1:toString','f:c2:toString'] } ROW COLUMN+CELL row-1 column=f:c1, timestamp=1587984555307, value=禅克 row-2 column=f:c2, timestamp=1587984555307, value=HBase老店 2 row(s) in 0.0180 seconds hbase(main):003:0> scan 'test', {COLUMNS => ['f:c1:c(org.apache.hadoop.hbase.util.Bytes).toString','f:c3:c(org.apache.hadoop.hbase.util.Bytes).toString'] } ROW COLUMN+CELL row-1 column=f:c1, timestamp=1587984555307, value=禅克 row-3 column=f:c3, timestamp=1587984555307, value=HBase工作笔记 2 row(s) in 0.0160 seconds hbase(main):055:0> scan 'test', {COLUMNS => ['f:c1:toString','f:c4:c(org.apache.hadoop.hbase.util.Bytes).toString'] } ROW COLUMN+CELL row-1 column=f:c1, timestamp=1587984555307, value=禅克 row-4 column=f:c4, timestamp=1587984555307, value=我爱你中国! 2 row(s) in 0.0290 seconds hbase(main):058:0> get 'test','row-2','f:c2:toString' COLUMN CELL f:c2 timestamp=1587984555307, value=Get到了吗?好意思不帮我分享嘛~哈哈~ 1 row(s) in 0.0070 seconds
import org.apache.commons.codec.binary.Hex; import org.junit.Test; public class HbaseTest { /** * HBASE中文转换 */ @Test public void testHbaseStr() throws Exception { // Hbase UTF8编码 String content = "\\xE7\\x83\\xA6"; char[] chars = content.toCharArray(); StringBuffer sb = new StringBuffer(); for (int i = 2; i < chars.length; i = i + 4) { // System.out.println(chars[i]); sb.append(chars[i]); // System.out.println(chars[i + 1]); sb.append(chars[i + 1]); } System.out.println(sb); String ouputStr = new String(Hex.decodeHex(sb.toString().toCharArray()), "UTF-8"); System.out.println(ouputStr); } }
可查看 Hdfs 上存储的数据大小间接得到表的大小:hdfs dfs -du -h /hbase/data/default/
困惑:在网上有搜到说在 hbase shell 里可以执行 size
、get_region_info
这些命令来查看情况,查到的文章如:hbase查看表占用空间大小 、hbase size 方法 、hbase如何看表大小,但我却无法执行报错,在 help 里也没搜到这些命令:
hbase(main):080:0> size 'heheda'
NoMethodError: undefined method `size' for main:Object
hbase(main):084:0> get_region_info 'heheda','haha_1705282403161'
NoMethodError: undefined method `get_region_info' for main:Object
公司 hbase 集群的 hdfs 使用量已经超过 80%,检查发现一个表数据量特别巨大,该表会记录用户每天的一次活动属性,按照4亿用户*197天,有800亿条的数据存放在表中有4TB大小,对于一个表来说过于大了。有两个问题:1、未开启压缩;2、没设置TTL。经过和业务方讨论,只保留最近93天(3个月)的数据,然后开启LZO压缩。
理论上所有的表都应该开启压缩,但是早期使用时没对业务方进行限制,导致现在有些表没开启压缩,而数据量又特别大,所以考虑在线开启压缩和 TTL。
考虑 hbase 写模型,压缩发生在 HFile 文件需要写HDFS的过程,这个过程有3种,第一:flush、第二:split、第三:compact。而对于已经存在的数据,应该只能在 compact 阶段进行。compact 的原理是读 region 中已有的 HFile,然后写新 HFile,理论上应该能保证新 HFile 写的时候是压缩的。
compact 过程分为普通 compact 和 majorcompact,普通 compact 就是N个文件合一个,N可以设置;majorcompact 就是 region 下所有的文件合1个。先读取原有文件,然后写新 HFile。来自:hbase TTL设置 指定天 hbase的ttl,实测可参考文章:Hbase之TTL
Cell TTL
(单元格 TTL) 处理和 ColumnFamily TTL
(列族 TTL) 之间存在两个显着差异:HBase:列族TTL和单元格TTL
为 hbase 表数据指定过期时间,达到过期时间后,compaction 时自动删除过期数据。来自:hbase设置表的TTL值
FOREVER
即永不过期,或者你可以指定一个 ColumnFamily 的 TTL(单位秒)值alter
、alter_async
,异步方式还可通过 alter_status
查看进度。通常选择异步方式,下边也以 alter_async
为例。NotServingRegionException
-- 设置、调大或调小TTL
-- 列族
alter_async 'TABLE_NAME',{NAME => 'f',TTL => '秒数'}
-- 列
alter_async 'TABLE_NAME',{NAME => 'f:a',TTL => '秒数'}
-- 恢复TTL为永久,其值不可以使用FOREVER或-1
alter_async 'TABLE_NAME',{NAME => 'f',TTL => '2147483647'}
注:网上大多说要先关闭表(describe tableName
)最后再执行 major_compact
操作,但我在实测中没有这两个操作直接修改表结构发现也可以成功。
其他参考文章:
HBase删除数据的原理
HBase的TTL介绍
HBase知识点总结
内存缓存:HBase使用内存缓存来提高读取性能。内存缓存分为两个层级:块缓存(Block Cache)和行缓存(Row Cache)。
hbase.regionserver.global.memstore.block.multiplier
参数来调整。默认情况下,块缓存的容量为堆内存的40%。CACHE_DATA
属性来启用行缓存,并通过 CACHE_DATA_BLOCK_ON_WRITE
属性来控制是否在写入时缓存数据。本地缓存:HBase还提供了本地缓存(Local Cache)来减少网络传输的开销。本地缓存是在客户端应用程序中维护的,它将最近访问的数据缓存在本地内存中。当应用程序需要读取数据时,首先检查本地缓存,如果数据在本地缓存中,则直接返回结果,避免了网络传输的开销。本地缓存的容量是有限的,可以通过合理设置缓存的大小来平衡内存消耗和性能提升。
来自:HBase缓存和压缩技术
在HBase中,数据压缩主要通过存储层的压缩技术实现。HBase支持多种压缩算法,如Gzip、LZO、Snappy等。压缩算法是一种将多个数据块映射到较小数据块的技术,可以减少存储空间和提高I/O性能。HBase的压缩算法可以在存储层和传输层应用,实现不同的效果。
检查当前 HBase 是否支持压缩:hbase org.apache.hadoop.util.NativeLibraryChecker
。参考:第十五记·HBase压缩、HBase与Hive集成详解
可以使用 CompressionTest 工具来验证 snappy 的压缩器可用:
bin/hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://hadoop01.com:8020/test.txt snappy
我在 CDH-6.3.2-1 里输出的结果是这样的,和参考文章里的不太一样也不知道是否正常。
# 创建指定压缩格式的表
create 'ods:tablename',{NAME=>'info',COMPRESSION=>'Snappy'},{NAME=>'f2'}
多数情况下,选择Snppy或LZ0是比较好的选择,因为它们的压缩开销底,能节省空间。来自:hbase数据压缩
参考:
数据压缩:HBase数据压缩的技术和方法
两篇对优化方面写的好的文章:
HBase优化之路-合理的使用编码压缩
hbase 压缩 hbase压缩方法
可查看我的另一篇文章:HBase Region分区、数据压缩及与Sqoop集成操作
hbase(main):092:0> help HBase Shell, version 2.1.0-cdh6.3.2, rUnknown, Fri Nov 8 05:44:07 PST 2019 Type 'help "COMMAND"', (e.g. 'help "get"' -- the quotes are necessary) for help on a specific command. Commands are grouped. Type 'help "COMMAND_GROUP"', (e.g. 'help "general"') for help on a command group. COMMAND GROUPS: Group name: general Commands: processlist, status, table_help, version, whoami Group name: ddl Commands: alter, alter_async, alter_status, clone_table_schema, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list, list_regions, locate_region, show_filters Group name: namespace Commands: alter_namespace, create_namespace, describe_namespace, drop_namespace, list_namespace, list_namespace_tables Group name: dml Commands: append, count, delete, deleteall, get, get_counter, get_splits, incr, put, scan, truncate, truncate_preserve Group name: tools Commands: assign, balance_switch, balancer, balancer_enabled, catalogjanitor_enabled, catalogjanitor_run, catalogjanitor_switch, cleaner_chore_enabled, cleaner_chore_run, cleaner_chore_switch, clear_block_cache, clear_compaction_queues, clear_deadservers, close_region , compact, compact_rs, compaction_state, flush, is_in_maintenance_mode, list_deadservers, major_compact, merge_region, move, normalize, normalizer_enabled, normalizer_switch, split, splitormerge_enabled, splitormerge_switch, stop_master, stop_regionserver, trace, unassign, wal_roll, zk_dump Group name: replication Commands: add_peer, append_peer_exclude_namespaces, append_peer_exclude_tableCFs, append_peer_namespaces, append_peer_tableCFs, disable_peer, disable_table_replication, enable_peer, enable_table_replication, get_peer_config, list_peer_configs, list_peers, list_replica ted_tables, remove_peer, remove_peer_exclude_namespaces, remove_peer_exclude_tableCFs, remove_peer_namespaces, remove_peer_tableCFs, set_peer_bandwidth, set_peer_exclude_namespaces, set_peer_exclude_tableCFs, set_peer_namespaces, set_peer_replicate_all, set_peer_serial, set_peer_tableCFs, show_peer_tableCFs, update_peer_config Group name: snapshots Commands: clone_snapshot, delete_all_snapshot, delete_snapshot, delete_table_snapshots, list_snapshots, list_table_snapshots, restore_snapshot, snapshot Group name: configuration Commands: update_all_config, update_config Group name: quotas Commands: list_quota_snapshots, list_quota_table_sizes, list_quotas, list_snapshot_sizes, set_quota Group name: security Commands: grant, list_security_capabilities, revoke, user_permission Group name: procedures Commands: list_locks, list_procedures Group name: visibility labels Commands: add_labels, clear_auths, get_auths, list_labels, set_auths, set_visibility Group name: rsgroup Commands: add_rsgroup, balance_rsgroup, get_rsgroup, get_server_rsgroup, get_table_rsgroup, list_rsgroups, move_namespaces_rsgroup, move_servers_namespaces_rsgroup, move_servers_rsgroup, move_servers_tables_rsgroup, move_tables_rsgroup, remove_rsgroup, remove_servers_ rsgroup SHELL USAGE: Quote all names in HBase Shell such as table and column names. Commas delimit command parameters. Type <RETURN> after entering a command to run it. Dictionaries of configuration used in the creation and alteration of tables are Ruby Hashes. They look like this: {'key1' => 'value1', 'key2' => 'value2', ...} and are opened and closed with curley-braces. Key/values are delimited by the '=>' character combination. Usually keys are predefined constants such as NAME, VERSIONS, COMPRESSION, etc. Constants do not need to be quoted. Type 'Object.constants' to see a (messy) list of all constants in the environment. If you are using binary keys or values and need to enter them in the shell, use double-quote'd hexadecimal representation. For example: hbase> get 't1', "key\x03\x3f\xcd" hbase> get 't1', "key\003\023\011" hbase> put 't1', "test\xef\xff", 'f1:', "\x01\x33\x40" The HBase shell is the (J)Ruby IRB with the above HBase-specific commands added. For more on the HBase Shell, see http://hbase.apache.org/book.html
查看单个命令的帮助:
hbase(main):093:0> help "get_peer_config"
Outputs the cluster key, replication endpoint class (if present), and any replication configuration parameters
hbase(main):094:0> help "list_peer_configs"
No-argument method that outputs the replication peer configuration for each peer defined on this cluster.
hbase(main):098:0> help "list"
List all user tables in hbase. Optional regular expression parameter could
be used to filter the output. Examples:
hbase> list
hbase> list 'abc.*'
hbase> list 'ns:abc.*'
hbase> list 'ns:.*'
参考:
hbase操作(shell 命令,如建表,清空表,增删改查)以及 hbase表存储结构和原理
【甘道夫】HBase基本数据操作详解【完整版,绝对精品】
HBase 常用Shell命令
HBase shell详情
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。