赞
踩
index、row group index、bloom filter index
轻量级索引Row Group Index
一个orc文件包含一个或者多个stripe(groups of row data),stripe里面存放数据和索引和stripe footer。每个stripe包含了每个列的最大值和最小值,当查询><=的时候,可以根据max和min来跳过不必要的stripe。
其中为每个stripe建立的包含min/max值的索引,就称为Row Group Index,也叫min-max Index,或者Storage Index。在建立ORC格式表时,指定表参数’orc.create.index’=’true’之后,便会建立Row Group Index,需要注意的是,为了使Row Group Index有效利用,向表中加载数据时,必须对需要使用索引的字段进行排序,否则,min/max会失去意义。另外,这种索引通常用于数值型字段的查询过滤优化上。
ORC查询优化
一个ORC文件会被分成多个stripe,而且文件的元数据中有每个字段的统计信息(min/max,hasNull等等),这就为ORC的查询优化做好了基础准备。假如我的查询过滤条件为WHERE id = 0;在Map Task读到一个ORC文件时,首先从文件的统计信息中看看id字段的min/max值,如果0不包含在内,那么这个文件就可以直接跳过了。基于这点,还有一个更有效的优化手段是在数据入库的时候,根据id字段排序后入库,这样尽量能使id=0的数据位于同一个文件甚至是同一个stripe中,那么在查询时候,只有负责读取该文件的Map Task需要扫描文件,其他的Map Task都会跳过扫描,大大节省Map Task的执行时间。
使用下面的HQL构造一个较大的ORC表:
CREATE TABLE test_orc3 stored AS ORC
AS
SELECT CAST(siteid AS INT) AS id,
pcid
FROM lxw1234_text
DISTRIBUTE BY id sort BY id;
该语句保证相同的id位于同一个ORC文件中,并且是排序的。
摘自:http://lxw1234.com/archives/2016/04/632.htm
distribute by和sort by的字段相同 = cluster by
order by 只会在一个reduce中,distribute by和sort by 来代替他,distribute by 会根据字段进行hash,分多个reduce
sort by 排序,在每一个reduce中进行排序
详情:https://blog.csdn.net/bitcarmanlee/article/details/51694616
- CREATE TABLE test_orc(
- id INT,
- name STRING
- ) stored AS ORC;
- 指定orc、分区分桶都需要用临时表insert进去,否则会出错,桶没有分等
-
- CREATE TABLE test_orc1 stored AS ORC
- TBLPROPERTIES
- ('orc.compress'='SNAPPY',
- 'orc.create.index'='true',
- 'orc.bloom.filter.fpp'='0.05',
- 'orc.stripe.size'='10485760',
- 'orc.row.index.stride'='10000')
- {
- "fileName": "\/user\/hive\/warehouse\/test.db\/test_orc\/000000_1",
- "fileVersion": "0.12",
- "writerVersion": "HIVE_13083",
- "numberOfRows": 90,
- "compression": "ZLIB",
- "compressionBufferSize": 262144,
- "schemaString": "struct<id:int,name:string>",
- "schema": [
- {
- "columnId": 0,
- "columnType": "STRUCT",
- "childColumnNames": [
- "id",
- "name"
- ],
- "childColumnIds": [
- 1,
- 2
- ]
- },
- {
- "columnId": 1,
- "columnType": "INT"
- },
- {
- "columnId": 2,
- "columnType": "STRING"
- }
- ],
- "stripeStatistics": [{
- "stripeNumber": 1,
- "columnStatistics": [
- {
- "columnId": 0,
- "count": 90,
- "hasNull": false
- },
- {
- "columnId": 1,
- "count": 90,
- "hasNull": false,
- "min": 1,
- "max": 7,
- "sum": 345,
- "type": "LONG"
- },
- {
- "columnId": 2,
- "count": 90,
- "hasNull": false,
- "min": "吕布",
- "max": "马超",
- "totalLength": 540,
- "type": "STRING"
- }
- ]
- }],
- "fileStatistics": [
- {
- "columnId": 0,
- "count": 90,
- "hasNull": false
- },
- {
- "columnId": 1,
- "count": 90,
- "hasNull": false,
- "min": 1,
- "max": 7,
- "sum": 345,
- "type": "LONG"
- },
- {
- "columnId": 2,
- "count": 90,
- "hasNull": false,
- "min": "吕布",
- "max": "马超",
- "totalLength": 540,
- "type": "STRING"
- }
- ],
- "stripes": [{
- "stripeNumber": 1,
- "stripeInformation": {
- "offset": 3,
- "indexLength": 73,
- "dataLength": 68,
- "footerLength": 53,
- "rowCount": 90
- },
- "streams": [
- {
- "columnId": 0,
- "section": "ROW_INDEX",
- "startOffset": 3,
- "length": 11
- },
- {
- "columnId": 1,
- "section": "ROW_INDEX",
- "startOffset": 14,
- "length": 25
- },
- {
- "columnId": 2,
- "section": "ROW_INDEX",
- "startOffset": 39,
- "length": 37
- },
- {
- "columnId": 1,
- "section": "DATA",
- "startOffset": 76,
- "length": 12
- },
- {
- "columnId": 2,
- "section": "DATA",
- "startOffset": 88,
- "length": 12
- },
- {
- "columnId": 2,
- "section": "LENGTH",
- "startOffset": 100,
- "length": 5
- },
- {
- "columnId": 2,
- "section": "DICTIONARY_DATA",
- "startOffset": 105,
- "length": 39
- }
- ],
- "encodings": [
- {
- "columnId": 0,
- "kind": "DIRECT"
- },
- {
- "columnId": 1,
- "kind": "DIRECT_V2"
- },
- {
- "columnId": 2,
- "kind": "DICTIONARY_V2",
- "dictionarySize": 6
- }
- ]
- }],
- "fileLength": 373,
- "paddingLength": 0,
- "paddingRatio": 0,
- "status": "OK"
- }
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。