赞
踩
由布隆(Burton Howard Bloom)在1970年提出的。它实际上是由一个很长的二进制向量和一系列随机映射函数组成,布隆过滤器可以用于检索一个元素是否在一个集合中。它的优点是空间效率和查询时间都远远超过一般的算法,缺点是有一定的误识别率(假正例False positives,即Bloom Filter报告某一元素存在于某集合中,但是实际上该元素并不在集合中)和删除困难,但是没有识别错误的情形(即假反例False negatives,如果某个元素确实没有在该集合中,那么Bloom Filter 是不会报告该元素存在于集合中的,所以不会漏报)。
如果想判断一个元素是不是在一个集合里,一般想到的是将所有元素保存起来,然后通过比较确定。链表,树等等数据结构都是这种思路. 但是随着集合中元素的增加,我们需要的存储空间越来越大,检索速度也越来越慢。不过世界上还有一种叫作散列表(又叫哈希表,Hash table)的数据结构。它可以通过一个Hash函数将一个元素映射成一个位阵列(Bit Array)中的一个点。这样一来,我们只要看看这个点是不是 1 就知道可以集合中有没有它了。这就是布隆过滤器的基本思想。
Hash面临的问题就是冲突。假设 Hash 函数是良好的,如果我们的位阵列长度为 m 个点,那么如果我们想将冲突率降低到例如 1%, 这个散列表就只能容纳 m/100 个元素。显然这就不叫空间有效了(Space-efficient)。解决方法也简单,就是使用多个 Hash,如果它们有一个说元素不在集合中,那肯定就不在。如果它们都说在,虽然也有一定可能性它们在说谎,不过直觉上判断这种事情的概率是比较低的。
相比于其它的数据结构,布隆过滤器在空间和时间方面都有巨大的优势。布隆过滤器存储空间和插入/查询时间都是常数。另外, Hash 函数相互之间没有关系,方便由硬件并行实现。布隆过滤器不需要存储元素本身,在某些对保密要求非常严格的场合有优势。
布隆过滤器可以表示全集,其它任何数据结构都不能;
但是布隆过滤器的缺点和优点一样明显。误算率(False Positive)是其中之一。随着存入的元素数量增加,误算率随之增加。但是如果元素数量太少,则使用散列表足矣。
另外,一般情况下不能从布隆过滤器中删除元素. 我们很容易想到把位列阵变成整数数组,每插入一个元素相应的计数器加1, 这样删除元素时将计数器减掉就可以了。然而要保证安全的删除元素并非如此简单。首先我们必须保证删除的元素的确在布隆过滤器里面. 这一点单凭这个过滤器是无法保证的。另外计数器回绕也会造成问题。
在很多Key-Value系统中也使用了布隆过滤器来加快查询过程,如 Hbase,Accumulo,Leveldb,一般而言,Value 保存在磁盘中,访问磁盘需要花费大量时间,然而使用布隆过滤器可以快速判断某个Key对应的Value是否存在,因此可以避免很多不必要的磁盘IO操作,只是引入布隆过滤器会带来一定的内存消耗。
即Bloom Filter报告某一元素存在于某集合中,但是实际上该元素并不在集合中。所以当布隆过滤器报告元素存在的时候(bloomfilter_check 返回1),元素只是大概率存在,但不是一定存在。
如果某个元素确实没有在该集合中,那么Bloom Filter 是不会报告该元素存在于集合中的,所以不会漏报。当布隆过滤器报告元素不存在的时候(bloomfilter_check 返回0),元素一定不存在。
经过我的修改,只需要一个头文件:bloomfilter.h
/** * Copyright (c) 2012-2016, Jyri J. Virkki * All rights reserved. * * This file is under BSD license. See LICENSE file. */ /** * @filename bloomfilter.h * bloom filter for Linux and Windows. * * @refer * https://github.com/shadowsocks/libbloom * * @author Jyri J. Virkki, Liang Zhang <350137278@qq.com> * @version 0.0.1 * @create 2019-10-22 * @update 2019-10-22 12:12:33 * * * Sample Usage * ------------------------------------ #include "bloomfilter.h" struct bloomfilter_t bloom = { 0 }; bloomfilter_init(&bloom, 1000000, 0.01); for (i = 0; i < 200000; i++) { buflen = snprintf_chkd_V1(buffer, sizeof(buffer), "number:%d", i); bloomfilter_add(&bloom, buffer, buflen); } bloomfilter_print(&bloom); i += 5; while (i-- > 200000-10) { buflen = snprintf_chkd_V1(buffer, sizeof(buffer), "number:%d", i--); if (bloomfilter_check(&bloom, buffer, buflen)) { printf("It may be there! '%s'\n", buffer); } else { printf("It must not exist! '%s'\n", buffer); } } bloomfilter_free(&bloom); * ------------------------------------ */ #ifndef BLOOMFILTER_H_INCLUDED #define BLOOMFILTER_H_INCLUDED #ifdef __cplusplus extern "C" { #endif #include <assert.h> #include <fcntl.h> #include <math.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/stat.h> #include <sys/types.h> // #include <unistd.h> #define BLOOMFILTER_VERSION "1.0.0" /** * Structure to keep track of one bloom filter. Caller needs to * allocate this and pass it to the functions below. First call for * every struct must be to bloom_init(). * */ struct bloomfilter_t { // These fields are part of the public interface of this structure. // Client code may read these values if desired. Client code MUST NOT // modify any of these. int entries; double error; int bits; int bytes; int hashes; // Fields below are private to the implementation. These may go away or // change incompatibly at any moment. Client code MUST NOT access or rely // on these. double bpe; unsigned char * bf; int ready; }; /** * MurmurHash2.c is taken from * http://sites.google.com/site/murmurhash/ * According to the above document: * All code is released to the public domain. For business purposes, * Murmurhash is under the MIT license. * * MurmurHash2, by Austin Appleby * * Note - This code makes a few assumptions about how your machine behaves - * 1. We can read a 4-byte value from any address without crashing * 2. sizeof(int) == 4 * * And it has a few limitations - * 1. It will not work incrementally. * 2. It will not produce the same results on little-endian and big-endian machines. */ static unsigned int murmurhash2 (const void * key, int len, const unsigned int seed) { // 'm' and 'r' are mixing constants generated offline. // They're not really 'magic', they just happen to work well. const unsigned int m = 0x5bd1e995; const int r = 24; // Initialize the hash to a 'random' value unsigned int h = seed ^ len; // Mix 4 bytes at a time into the hash const unsigned char * data = (const unsigned char *)key; while (len >= 4) { unsigned int k = *(unsigned int *)data; k *= m; k ^= k >> r; k *= m; h *= m; h ^= k; data += 4; len -= 4; } // Handle the last few bytes of the input array switch (len) { case 3: h ^= data[2] << 16; case 2: h ^= data[1] << 8; case 1: h ^= data[0]; h *= m; }; // Do a few final mixes of the hash to ensure the last few // bytes are well-incorporated. h ^= h >> 13; h *= m; h ^= h >> 15; return h; } static int bloom_test_bit_set_bit (unsigned char * buf, unsigned int x, int set_bit) { unsigned int byte = x >> 3; // expensive memory access unsigned char c = buf[byte]; unsigned int mask = 1 << (x % 8); if (c & mask) { return 1; } else { if (set_bit) { buf[byte] = c | mask; } return 0; } } static int bloom_check_add (struct bloomfilter_t * bloom, const void * buffer, int len, int add) { if (bloom->ready == 0) { printf("bloom at %p not initialized!\n", (void *)bloom); return -1; } int hits = 0; register unsigned int a = murmurhash2(buffer, len, 0x9747b28c); register unsigned int b = murmurhash2(buffer, len, a); register unsigned int x; register unsigned int i; for (i = 0; i < (unsigned int) bloom->hashes; i++) { x = (a + i*b) % bloom->bits; if (bloom_test_bit_set_bit(bloom->bf, x, add)) { hits++; } } if (hits == bloom->hashes) { // 1 == element already in (or collision) return 1; } return 0; } /** * Initialize the bloom filter for use. * * The filter is initialized with a bit field and number of hash functions * according to the computations from the wikipedia entry: * http://en.wikipedia.org/wiki/Bloom_filter * * Optimal number of bits is: * bits = (entries * ln(error)) / ln(2)^2 * * Optimal number of hash functions is: * hashes = bpe * ln(2) * * Parameters: * ----------- * bloom - Pointer to an allocated struct bloomfilter_t (see above). * entries - The expected number of entries which will be inserted. * error - Probability of collision (as long as entries are not * exceeded). * * Return: * ------- * 0 - on success * 1 - on failure * */ static int bloomfilter_init (struct bloomfilter_t * bloom, int entries, double error) { bloom->ready = 0; if (entries < 1 || error == 0) { return 1; } bloom->entries = entries; bloom->error = error; double num = log(bloom->error); double denom = 0.480453013918201; // ln(2)^2 bloom->bpe = -(num / denom); double dentries = (double)entries; bloom->bits = (int)(dentries * bloom->bpe); if (bloom->bits % 8) { bloom->bytes = (bloom->bits / 8) + 1; } else { bloom->bytes = bloom->bits / 8; } bloom->hashes = (int)ceil(0.693147180559945 * bloom->bpe); // ln(2) bloom->bf = (unsigned char *)calloc(bloom->bytes, sizeof(unsigned char)); if (bloom->bf == NULL) { return 1; } bloom->ready = 1; return 0; } /** * Check if the given element is in the bloom filter. Remember this may * return false positive if a collision occured. * * Parameters: * ----------- * bloom - Pointer to an allocated struct bloomfilter_t (see above). * buffer - Pointer to buffer containing element to check. * len - Size of 'buffer'. * * Return: * ------- * 0 - element is not present * 1 - element is present (or false positive due to collision) * -1 - bloom not initialized * */ static int bloomfilter_check (struct bloomfilter_t * bloom, const void * buffer, int len) { return bloom_check_add(bloom, buffer, len, 0); } /** * Add the given element to the bloom filter. * The return code indicates if the element (or a collision) was already in, * so for the common check+add use case, no need to call check separately. * * Parameters: * ----------- * bloom - Pointer to an allocated struct bloomfilter_t (see above). * buffer - Pointer to buffer containing element to add. * len - Size of 'buffer'. * * Return: * ------- * 0 - element was not present and was added * 1 - element (or a collision) had already been added previously * -1 - bloom not initialized * */ static int bloomfilter_add (struct bloomfilter_t * bloom, const void * buffer, int len) { return bloom_check_add(bloom, buffer, len, 1); } /** * Print (to stdout) info about this bloom filter. Debugging aid. * */ static void bloomfilter_print (struct bloomfilter_t * bloom) { printf("bloom at %p\n", (void *) bloom); printf(" .entries = %d\n", bloom->entries); printf(" .error = %f\n", bloom->error); printf(" .bits = %d\n", bloom->bits); printf(" .bits-per-elem = %f\n", bloom->bpe); printf(" .bytes = %d\n", bloom->bytes); printf(" .hash-functions = %d\n", bloom->hashes); } /** * Deallocate internal storage. * * Upon return, the bloom struct is no longer usable. You may call bloom_init * again on the same struct to reinitialize it again. * * Parameters: * ----------- * bloom - Pointer to an allocated struct bloom (see above). * * Return: none * */ static void bloomfilter_free (struct bloomfilter_t * bloom) { if (bloom->ready) { free(bloom->bf); } bloom->ready = 0; } /** * Returns version string compiled into library. * * Return: version string * */ static const char * bloomfilter_version (void) { return BLOOMFILTER_VERSION; } #ifdef __cplusplus } #endif #endif
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。