当前位置:   article > 正文

Roaring Bitmap 原理及实践_roaringbitmap

roaringbitmap

Roaring Bitmap 介绍

存储整数集(类似于存储整数的 Set),并支持整数集的集合运算(交并差等)。之所以在一些场景下使用 Roaring Bitmap,而不是 Set,是因为其节省存储空间以及集合运算效率更优。

使用示例

<dependencies>
    <dependency>
        <groupId>org.roaringbitmap</groupId>
        <artifactId>RoaringBitmap</artifactId>
        <version>0.9.0</version>
    </dependency>
</dependencies>
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
package roaringbitmap;

import org.roaringbitmap.RoaringBitmap;
import org.roaringbitmap.longlong.Roaring64Bitmap;

import java.util.Iterator;

public class Test {
    public static void main(String[] args) {
        RoaringBitmap r1 = RoaringBitmap.bitmapOf(15, 0, Integer.MIN_VALUE, Integer.MAX_VALUE, -1, -8);
        print(r1, "r1");

        RoaringBitmap r2 = new RoaringBitmap();
        r2.add(-1);
        r2.add(0);
        r2.add(1);
        print(r2, "r2");

        RoaringBitmap intersect = RoaringBitmap.and(r1, r2);
        print(intersect, "r1-and-r2");

        System.out.println("=================");

        Roaring64Bitmap r64 = Roaring64Bitmap.bitmapOf(1, 2, Long.MAX_VALUE, Long.MIN_VALUE, -1, -3);
        System.out.println("RBM[r64] cardinality: " + r64.getLongCardinality());
        Iterator<Long> iterator = r64.iterator();
        while (iterator.hasNext()) {
            System.out.print(iterator.next() + "\t");
        }
        System.out.println();

    }

    public static void print(RoaringBitmap r, String name) {
        System.out.println("RBM[" + name + "] cardinality: " + r.getLongCardinality());
        for(int i: r) {
            System.out.print(i + "\t");
        }
        System.out.println();
    }


}
/*
RBM[r1] cardinality: 6 
0        15        2147483647        -2147483648        -8        -1         
RBM[r2] cardinality: 3 
0        1        -1         
RBM[r1-and-r2] cardinality: 2 
0        -1         
================= 
RBM[r64] cardinality: 6 
1        2        9223372036854775807        -9223372036854775808        -3        -1    
*/
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54

Bitmap/Bitset

问:1 亿个设备 id(imeimd5, 长度 32 字符串, 512 bit)需要多少存储空间?
答:1 亿 * 64 Byte = 64 亿 Byte = 6.4 G

但是,如果一个设备 id 使用 1 bit 存储,需要存储空间是 6.4G / 512 = 12.5 M

set = {1, 2, 15}
在这里插入图片描述

package roaringbitmap;

public class Test {
    public static void main(String[] args) {
        byte[] bits = new byte[2];

        add(bits, 1);
        add(bits, 2);
        System.out.println(Integer.toBinaryString(bits[0]));    // 110
    }

    public static void add(byte[] bits, int num){
        byte pre = bits[num / 8];
        byte newVal = (byte) (pre | (1 << (num % 8)));
        bits[num / 8] = newVal;
    }
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17

缺点:稀疏数据存储依旧浪费空间,举一个极端例子,用位图只存储数字 0 和 1599(8*200 - 1),至少需要开辟长度为 200 的 byte 数组,而普通 set 存储 0 和 1599 两个 Integer 只需要 8 字节!

Roaring Bitmap 理论

We partition the range of 32-bit indexes ( [0,n) ) into chunks of 2^16 integers sharing the same 16 most significant digits. We use specialized containers to store their 16 least significant bits.
When a chunk contains no more than 4096 integers, we use a sorted array of packed 16-bit integers. When there are more than 4096 integers, we use a 2^16 -bit bitmap. Thus, we have two types of containers: an array container for sparse chunks and a bitmap container for dense chunks.

在这里插入图片描述

  • 根据实际数据状况,桶动态生成(而不是一开始就建立所有桶,避免不必要的空间浪费)
  • 根据实际数据状况,若桶中数据少(稀疏数据),使用 array 存;若桶中数据多,使用 bitmap 存
  • 为了设计及实现简单,bitmap 大小固定为 2^16 位(8KB)。4096 个 Integer(低 16 位)整好占 8KB 存储,为了节省空间,桶中数据少于 4096 个数时,使用 array,否则使用 bitmap。

Our new addition, the run container, is made of a packed array of pairs of 16-bit integers. The first value of each pair represents a starting value, whereas the second value is the length of a run. For example, we would store the values 11,12,13,14,15 as the pair 11,4 where 4 means that beyond 11 itself,there are 4 contiguous values that follow.

Roaring Bitmap 源码

/**
 * RoaringBitmap, a compressed alternative to the BitSet.
 *
 * <pre>
 * {@code
 *      import org.roaringbitmap.*;
 *
 *      //...
 *
 *      RoaringBitmap rr = RoaringBitmap.bitmapOf(1,2,3,1000);
 *      RoaringBitmap rr2 = new RoaringBitmap();
 *      for(int k = 4000; k<4255;++k) rr2.add(k);
 *      RoaringBitmap rror = RoaringBitmap.or(rr, rr2);
 *
 *      //...
 *      DataOutputStream wheretoserialize = ...
 *      rr.runOptimize(); // can help compression
 *      rr.serialize(wheretoserialize);
 * }
 * </pre>
 *
 * Integers are added in unsigned sorted order. That is, they are treated as unsigned integers (see
 * Java 8's Integer.toUnsignedLong function).
 * Up to 4294967296 integers
 * can be stored.
 *
 *
 *
 */
 
 
 /**
 * Roaring64Bitmap is a compressed 64 bit bitmap. It can contain all the numbers of long
 * rang[Long.MAX_VALUE, Long.MIN_VALUE]. Since java has no unsigned long,we treat the negative value
 * as a successor of the positive value. So the ascending ordering of all the long value is:
 * 0,1,2...Long.MAX_VALUE,Long.MIN_VALUE,Long.MIN_VALUE+1.......-1. See Long.toUnsignedString()
 */
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37

关键类图

在这里插入图片描述

部分源码

/**
    RoaringBitmap 初始化
*/

RoaringBitmap r2 = new RoaringBitmap();

// step 1
public RoaringBitmap() {
  highLowContainer = new RoaringArray();
}
// step 2
protected RoaringArray() {
  this(INITIAL_CAPACITY);    // INITIAL_CAPACITY = 4
}
// step 3
RoaringArray(int initialCapacity) {
  this(new char[initialCapacity], new Container[initialCapacity], 0);
}
// step 4
RoaringArray(char[] keys, Container[] values, int size) {
  this.keys = keys;
  this.values = values;
  this.size = size;
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
/**
    org.roaringbitmap.RoaringBitmap#add(int)
*/

public void add(final int x) {
  final char hb = Util.highbits(x);
  final int i = highLowContainer.getIndex(hb);
  if (i >= 0) {
    highLowContainer.setContainerAtIndex(i,
        highLowContainer.getContainerAtIndex(i).add(Util.lowbits(x)));
  } else {
    final ArrayContainer newac = new ArrayContainer();
    highLowContainer.insertNewKeyValueAt(-i - 1, hb, newac.add(Util.lowbits(x)));
  }
}

// (tips)获取 Integer 高 16 位
protected static char highbits(int x) {
  return (char) (x >>> 16);
}
// (tips)获取 Integer 低 16 位
protected static char lowbits(int x) {
  return (char) x;
}

// (tips)查找等于 k 的位置,或第一个大于 k 的位置
// starts with binary search and finishes with a sequential search
protected static int hybridUnsignedBinarySearch(final char[] array, final int begin,
    final int end, final char k) {
  // next line accelerates the possibly common case where the value would
  // be inserted at the end
  if ((end > 0) && ((array[end - 1]) < (int) k)) {
    return -end - 1;
  }
  int low = begin;
  int high = end - 1;
  // 32 in the next line matches the size of a cache line
  while (low + 32 <= high) {
    final int middleIndex = (low + high) >>> 1;
    final int middleValue = (array[middleIndex]);

    if (middleValue < (int) k) {
      low = middleIndex + 1;
    } else if (middleValue > (int) k) {
      high = middleIndex - 1;
    } else {
      return middleIndex;
    }
  }
  // we finish the job with a sequential search
  int x = low;
  for (; x <= high; ++x) {
    final int val = (array[x]);
    if (val >= (int) k) {
      if (val == (int) k) {
        return x;
      }
      break;
    }
  }
  return -(x + 1);
}

// 查看缓存行大小
// cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
/**
    org.roaringbitmap.BitmapContainer
*/

@Override
public Container add(final char i) {
  final long previous = bitmap[i >>> 6];
  long newval = previous | (1L << i); // 1 << (i % 64)
  bitmap[i >>> 6] = newval;
  if (USE_BRANCHLESS) {    // optimization flag: whether the cardinality of the bitmaps
                           // is maintained through branchless operations
    cardinality += (int)((previous ^ newval) >>> i);    // 无需判断 previous 是否等于 newval
                                                   // 避免 CPU 分支预测错误带来的性能开销
  } else if (previous != newval) {
    ++cardinality;
  }
  return this;
}


@Override
public boolean contains(final char i) {
  return (bitmap[i >>> 6] & (1L << i)) != 0;
}


@Override
public int getCardinality() {
  return cardinality;
}


@Override
public Container remove(final char i) {
  int index = i >>> 6;
  long bef = bitmap[index];
  long mask = 1L << i;
  if (cardinality == ArrayContainer.DEFAULT_MAX_SIZE + 1) {    // DEFAULT_MAX_SIZE = 4096
    // this is
    // the
    // uncommon
    // path
    if ((bef & mask) != 0) {
      --cardinality;
      bitmap[i >>> 6] = bef & ~mask;    // 存储数 i 对应位置成 0
      return this.toArrayContainer();
    }
  }
  long aft = bef & ~mask;    // 存储数 i 的位由 1 置成 0
  cardinality -= (aft - bef) >>> 63;    // 存储数 i 的位若由 1 置成 0,aft - bef < 0,其最高位必为 1
  bitmap[index] = aft;
  return this;
}


/**
 * Fill the array with set bits
 *
 * @param array container (should be sufficiently large)
 */
void fillArray(final char[] array) {
  int pos = 0;
  int base = 0;
  for (int k = 0; k < bitmap.length; ++k) {
    long bitset = bitmap[k];
    while (bitset != 0) {
      // Long.numberOfTrailingZeros
      // Returns the number of zero bits following the lowest-order ("rightmost") one-bit 
      // in the two's complement binary representation of the specified long value. 
      // Returns 64 if the specified value has no one-bits in its two's complement representation,
      // in other words if it is equal to zero.
      array[pos++] = (char) (base + numberOfTrailingZeros(bitset));
      bitset &= (bitset - 1);    // 把最低 1 位置成 0
    }
    base += 64;
  }
}


@Override
public Container and(final BitmapContainer value2) {
  int newCardinality = 0;
  for (int k = 0; k < this.bitmap.length; ++k) {
    // [2] CPU 实现了快速计算一个处理字中位 1 个数的指令,Long.bitCount 底层基于此指令实现
    newCardinality += Long.bitCount(this.bitmap[k] & value2.bitmap[k]);
  }
  if (newCardinality > ArrayContainer.DEFAULT_MAX_SIZE) {
    final BitmapContainer answer = new BitmapContainer();
    for (int k = 0; k < answer.bitmap.length; ++k) {
      answer.bitmap[k] = this.bitmap[k] & value2.bitmap[k];
    }
    answer.cardinality = newCardinality;
    return answer;
  }
  ArrayContainer ac = new ArrayContainer(newCardinality);
  Util.fillArrayAND(ac.content, this.bitmap, value2.bitmap);
  ac.cardinality = newCardinality;
  return ac;
}


@Override
public ArrayContainer and(final ArrayContainer value2) {
  final ArrayContainer answer = new ArrayContainer(value2.content.length);
  int c = value2.cardinality;
  for (int k = 0; k < c; ++k) {
    char v = value2.content[k];
    answer.content[answer.cardinality] = v;
    answer.cardinality += (int)this.bitValue(v);
  }
  return answer;
}
int bitValue(final char i) {
  return (int)(bitmap[i >>> 6] >>> i ) & 1;
}

/**
    空间压缩(优化),比较当前容器占据空间和使用 RunContainer 占据空间的大小关系
*/
@Override
public Container runOptimize() {
  // 计算一下,如果用 RunContainer 存储,需要多少轮(每轮需要 2 个 char,4 Bytes)
  int numRuns = numberOfRunsLowerBound(MAXRUNS); // decent choice

  int sizeAsRunContainerLowerBound = RunContainer.serializedSizeInBytes(numRuns);

  if (sizeAsRunContainerLowerBound >= getArraySizeInBytes()) {
    return this;
  }
  // else numRuns is a relatively tight bound that needs to be exact
  // in some cases (or if we need to make the runContainer the right
  // size)
  numRuns += numberOfRunsAdjustment();
  int sizeAsRunContainer = RunContainer.serializedSizeInBytes(numRuns);

  if (getArraySizeInBytes() > sizeAsRunContainer) {
    return new RunContainer(this, numRuns);
  } else {
    return this;
  }
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124
  • 125
  • 126
  • 127
  • 128
  • 129
  • 130
  • 131
  • 132
  • 133
  • 134
  • 135
  • 136
  • 137
  • 138
  • 139
  • 140
  • 141
/**
    org.roaringbitmap.ArrayContainer
*/

// ArrayContainer 动态扩容设计,DEFAULT_INIT_SIZE = 4
private void increaseCapacity(boolean allowIllegalSize) {
  int newCapacity = (this.content.length == 0) ? DEFAULT_INIT_SIZE
      : this.content.length < 64 ? this.content.length * 2
          : this.content.length < 1067 ? this.content.length * 3 / 2
              : this.content.length * 5 / 4;
  // never allocate more than we will ever need
  if (newCapacity > ArrayContainer.DEFAULT_MAX_SIZE && !allowIllegalSize) {
    newCapacity = ArrayContainer.DEFAULT_MAX_SIZE;
  }
  // if we are within 1/16th of the max, go to max
  if (newCapacity > ArrayContainer.DEFAULT_MAX_SIZE - ArrayContainer.DEFAULT_MAX_SIZE / 16
      && !allowIllegalSize) {
    newCapacity = ArrayContainer.DEFAULT_MAX_SIZE;
  }
  this.content = Arrays.copyOf(this.content, newCapacity);
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 大量位运算技巧,加速计算
  • 充分利用 CPU 硬件特性(高速缓存/避免分支预测/特殊硬件指令bitCount),加速计算
  • ArrayContainer 合理扩容策略,充分节省存储

Roaring Bitmap 实践

在大数据标签平台中,用于标签圈选计算符合圈选条件的人群,一来加速计算,二来压缩存储,效果甚好

参考资料

[1] Roaring Bitmap 官网
[2]《Better bitmap performance with Roaring bitmaps》
[3]《Consistently faster and smaller compressed bitmaps with Roaring》
[4] RoaringBitmap 工程 Github 源码

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/不正经/article/detail/478747
推荐阅读
相关标签
  

闽ICP备14008679号