赞
踩
存储整数集(类似于存储整数的 Set),并支持整数集的集合运算(交并差等)。之所以在一些场景下使用 Roaring Bitmap,而不是 Set,是因为其节省存储空间以及集合运算效率更优。
<dependencies>
<dependency>
<groupId>org.roaringbitmap</groupId>
<artifactId>RoaringBitmap</artifactId>
<version>0.9.0</version>
</dependency>
</dependencies>
package roaringbitmap; import org.roaringbitmap.RoaringBitmap; import org.roaringbitmap.longlong.Roaring64Bitmap; import java.util.Iterator; public class Test { public static void main(String[] args) { RoaringBitmap r1 = RoaringBitmap.bitmapOf(15, 0, Integer.MIN_VALUE, Integer.MAX_VALUE, -1, -8); print(r1, "r1"); RoaringBitmap r2 = new RoaringBitmap(); r2.add(-1); r2.add(0); r2.add(1); print(r2, "r2"); RoaringBitmap intersect = RoaringBitmap.and(r1, r2); print(intersect, "r1-and-r2"); System.out.println("================="); Roaring64Bitmap r64 = Roaring64Bitmap.bitmapOf(1, 2, Long.MAX_VALUE, Long.MIN_VALUE, -1, -3); System.out.println("RBM[r64] cardinality: " + r64.getLongCardinality()); Iterator<Long> iterator = r64.iterator(); while (iterator.hasNext()) { System.out.print(iterator.next() + "\t"); } System.out.println(); } public static void print(RoaringBitmap r, String name) { System.out.println("RBM[" + name + "] cardinality: " + r.getLongCardinality()); for(int i: r) { System.out.print(i + "\t"); } System.out.println(); } } /* RBM[r1] cardinality: 6 0 15 2147483647 -2147483648 -8 -1 RBM[r2] cardinality: 3 0 1 -1 RBM[r1-and-r2] cardinality: 2 0 -1 ================= RBM[r64] cardinality: 6 1 2 9223372036854775807 -9223372036854775808 -3 -1 */
问:1 亿个设备 id(imeimd5, 长度 32 字符串, 512 bit)需要多少存储空间?
答:1 亿 * 64 Byte = 64 亿 Byte = 6.4 G
但是,如果一个设备 id 使用 1 bit 存储,需要存储空间是 6.4G / 512 = 12.5 M
set = {1, 2, 15}
package roaringbitmap; public class Test { public static void main(String[] args) { byte[] bits = new byte[2]; add(bits, 1); add(bits, 2); System.out.println(Integer.toBinaryString(bits[0])); // 110 } public static void add(byte[] bits, int num){ byte pre = bits[num / 8]; byte newVal = (byte) (pre | (1 << (num % 8))); bits[num / 8] = newVal; } }
缺点:稀疏数据存储依旧浪费空间,举一个极端例子,用位图只存储数字 0 和 1599(8*200 - 1),至少需要开辟长度为 200 的 byte 数组,而普通 set 存储 0 和 1599 两个 Integer 只需要 8 字节!
We partition the range of 32-bit indexes ( [0,n) ) into chunks of 2^16 integers sharing the same 16 most significant digits. We use specialized containers to store their 16 least significant bits.
When a chunk contains no more than 4096 integers, we use a sorted array of packed 16-bit integers. When there are more than 4096 integers, we use a 2^16 -bit bitmap. Thus, we have two types of containers: an array container for sparse chunks and a bitmap container for dense chunks.
Our new addition, the run container, is made of a packed array of pairs of 16-bit integers. The first value of each pair represents a starting value, whereas the second value is the length of a run. For example, we would store the values 11,12,13,14,15 as the pair 11,4 where 4 means that beyond 11 itself,there are 4 contiguous values that follow.
/** * RoaringBitmap, a compressed alternative to the BitSet. * * <pre> * {@code * import org.roaringbitmap.*; * * //... * * RoaringBitmap rr = RoaringBitmap.bitmapOf(1,2,3,1000); * RoaringBitmap rr2 = new RoaringBitmap(); * for(int k = 4000; k<4255;++k) rr2.add(k); * RoaringBitmap rror = RoaringBitmap.or(rr, rr2); * * //... * DataOutputStream wheretoserialize = ... * rr.runOptimize(); // can help compression * rr.serialize(wheretoserialize); * } * </pre> * * Integers are added in unsigned sorted order. That is, they are treated as unsigned integers (see * Java 8's Integer.toUnsignedLong function). * Up to 4294967296 integers * can be stored. * * * */ /** * Roaring64Bitmap is a compressed 64 bit bitmap. It can contain all the numbers of long * rang[Long.MAX_VALUE, Long.MIN_VALUE]. Since java has no unsigned long,we treat the negative value * as a successor of the positive value. So the ascending ordering of all the long value is: * 0,1,2...Long.MAX_VALUE,Long.MIN_VALUE,Long.MIN_VALUE+1.......-1. See Long.toUnsignedString() */
/** RoaringBitmap 初始化 */ RoaringBitmap r2 = new RoaringBitmap(); // step 1 public RoaringBitmap() { highLowContainer = new RoaringArray(); } // step 2 protected RoaringArray() { this(INITIAL_CAPACITY); // INITIAL_CAPACITY = 4 } // step 3 RoaringArray(int initialCapacity) { this(new char[initialCapacity], new Container[initialCapacity], 0); } // step 4 RoaringArray(char[] keys, Container[] values, int size) { this.keys = keys; this.values = values; this.size = size; }
/** org.roaringbitmap.RoaringBitmap#add(int) */ public void add(final int x) { final char hb = Util.highbits(x); final int i = highLowContainer.getIndex(hb); if (i >= 0) { highLowContainer.setContainerAtIndex(i, highLowContainer.getContainerAtIndex(i).add(Util.lowbits(x))); } else { final ArrayContainer newac = new ArrayContainer(); highLowContainer.insertNewKeyValueAt(-i - 1, hb, newac.add(Util.lowbits(x))); } } // (tips)获取 Integer 高 16 位 protected static char highbits(int x) { return (char) (x >>> 16); } // (tips)获取 Integer 低 16 位 protected static char lowbits(int x) { return (char) x; } // (tips)查找等于 k 的位置,或第一个大于 k 的位置 // starts with binary search and finishes with a sequential search protected static int hybridUnsignedBinarySearch(final char[] array, final int begin, final int end, final char k) { // next line accelerates the possibly common case where the value would // be inserted at the end if ((end > 0) && ((array[end - 1]) < (int) k)) { return -end - 1; } int low = begin; int high = end - 1; // 32 in the next line matches the size of a cache line while (low + 32 <= high) { final int middleIndex = (low + high) >>> 1; final int middleValue = (array[middleIndex]); if (middleValue < (int) k) { low = middleIndex + 1; } else if (middleValue > (int) k) { high = middleIndex - 1; } else { return middleIndex; } } // we finish the job with a sequential search int x = low; for (; x <= high; ++x) { final int val = (array[x]); if (val >= (int) k) { if (val == (int) k) { return x; } break; } } return -(x + 1); } // 查看缓存行大小 // cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
/** org.roaringbitmap.BitmapContainer */ @Override public Container add(final char i) { final long previous = bitmap[i >>> 6]; long newval = previous | (1L << i); // 1 << (i % 64) bitmap[i >>> 6] = newval; if (USE_BRANCHLESS) { // optimization flag: whether the cardinality of the bitmaps // is maintained through branchless operations cardinality += (int)((previous ^ newval) >>> i); // 无需判断 previous 是否等于 newval // 避免 CPU 分支预测错误带来的性能开销 } else if (previous != newval) { ++cardinality; } return this; } @Override public boolean contains(final char i) { return (bitmap[i >>> 6] & (1L << i)) != 0; } @Override public int getCardinality() { return cardinality; } @Override public Container remove(final char i) { int index = i >>> 6; long bef = bitmap[index]; long mask = 1L << i; if (cardinality == ArrayContainer.DEFAULT_MAX_SIZE + 1) { // DEFAULT_MAX_SIZE = 4096 // this is // the // uncommon // path if ((bef & mask) != 0) { --cardinality; bitmap[i >>> 6] = bef & ~mask; // 存储数 i 对应位置成 0 return this.toArrayContainer(); } } long aft = bef & ~mask; // 存储数 i 的位由 1 置成 0 cardinality -= (aft - bef) >>> 63; // 存储数 i 的位若由 1 置成 0,aft - bef < 0,其最高位必为 1 bitmap[index] = aft; return this; } /** * Fill the array with set bits * * @param array container (should be sufficiently large) */ void fillArray(final char[] array) { int pos = 0; int base = 0; for (int k = 0; k < bitmap.length; ++k) { long bitset = bitmap[k]; while (bitset != 0) { // Long.numberOfTrailingZeros // Returns the number of zero bits following the lowest-order ("rightmost") one-bit // in the two's complement binary representation of the specified long value. // Returns 64 if the specified value has no one-bits in its two's complement representation, // in other words if it is equal to zero. array[pos++] = (char) (base + numberOfTrailingZeros(bitset)); bitset &= (bitset - 1); // 把最低 1 位置成 0 } base += 64; } } @Override public Container and(final BitmapContainer value2) { int newCardinality = 0; for (int k = 0; k < this.bitmap.length; ++k) { // [2] CPU 实现了快速计算一个处理字中位 1 个数的指令,Long.bitCount 底层基于此指令实现 newCardinality += Long.bitCount(this.bitmap[k] & value2.bitmap[k]); } if (newCardinality > ArrayContainer.DEFAULT_MAX_SIZE) { final BitmapContainer answer = new BitmapContainer(); for (int k = 0; k < answer.bitmap.length; ++k) { answer.bitmap[k] = this.bitmap[k] & value2.bitmap[k]; } answer.cardinality = newCardinality; return answer; } ArrayContainer ac = new ArrayContainer(newCardinality); Util.fillArrayAND(ac.content, this.bitmap, value2.bitmap); ac.cardinality = newCardinality; return ac; } @Override public ArrayContainer and(final ArrayContainer value2) { final ArrayContainer answer = new ArrayContainer(value2.content.length); int c = value2.cardinality; for (int k = 0; k < c; ++k) { char v = value2.content[k]; answer.content[answer.cardinality] = v; answer.cardinality += (int)this.bitValue(v); } return answer; } int bitValue(final char i) { return (int)(bitmap[i >>> 6] >>> i ) & 1; } /** 空间压缩(优化),比较当前容器占据空间和使用 RunContainer 占据空间的大小关系 */ @Override public Container runOptimize() { // 计算一下,如果用 RunContainer 存储,需要多少轮(每轮需要 2 个 char,4 Bytes) int numRuns = numberOfRunsLowerBound(MAXRUNS); // decent choice int sizeAsRunContainerLowerBound = RunContainer.serializedSizeInBytes(numRuns); if (sizeAsRunContainerLowerBound >= getArraySizeInBytes()) { return this; } // else numRuns is a relatively tight bound that needs to be exact // in some cases (or if we need to make the runContainer the right // size) numRuns += numberOfRunsAdjustment(); int sizeAsRunContainer = RunContainer.serializedSizeInBytes(numRuns); if (getArraySizeInBytes() > sizeAsRunContainer) { return new RunContainer(this, numRuns); } else { return this; } }
/** org.roaringbitmap.ArrayContainer */ // ArrayContainer 动态扩容设计,DEFAULT_INIT_SIZE = 4 private void increaseCapacity(boolean allowIllegalSize) { int newCapacity = (this.content.length == 0) ? DEFAULT_INIT_SIZE : this.content.length < 64 ? this.content.length * 2 : this.content.length < 1067 ? this.content.length * 3 / 2 : this.content.length * 5 / 4; // never allocate more than we will ever need if (newCapacity > ArrayContainer.DEFAULT_MAX_SIZE && !allowIllegalSize) { newCapacity = ArrayContainer.DEFAULT_MAX_SIZE; } // if we are within 1/16th of the max, go to max if (newCapacity > ArrayContainer.DEFAULT_MAX_SIZE - ArrayContainer.DEFAULT_MAX_SIZE / 16 && !allowIllegalSize) { newCapacity = ArrayContainer.DEFAULT_MAX_SIZE; } this.content = Arrays.copyOf(this.content, newCapacity); }
在大数据标签平台中,用于标签圈选计算符合圈选条件的人群,一来加速计算,二来压缩存储,效果甚好
[1] Roaring Bitmap 官网
[2]《Better bitmap performance with Roaring bitmaps》
[3]《Consistently faster and smaller compressed bitmaps with Roaring》
[4] RoaringBitmap 工程 Github 源码
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。