Target audience: Intermediate

Estimated reading time: 20'

A brief introduction to the Bloom filter and its implementation in Scala using a cryptographic digest.

## Overview

Bloom filter became a popular probabilistic data structure to enable membership queries (object x belonging to set or category Y) a couple of years ago. The main benefit of Bloom filter is to reduce the requirement of large memory allocation by avoiding allocating objects in memory much like HashSet or HashMap. The compact representation comes with a trade-off: although the filter does not allow false negatives it does not guarantee that there is no false positives.

In other words, a query returns:

- very high probability that an object belong to a set
- an object does not belong to a set

**Note**: For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted.

## Theory

Let's consider a set A = {a0,.. an-1} of n elements for which a query to determine membership is executed. The data structure consists of a bit vector V of m bits and k completely independent hash functions that are associated to a position in the bit vector. The assignment (or mapping) of hash functions to bits has to follow a uniform distribution.

The diagram below illustrates the basic mechanism behind the Bloom filter. The set A is defined by the pair a1 and a2. The hash functions h1 and h2 map the elements to bit position (bit set to 1) in the bit vector. The element b has one of the position set to 0 and therefore does not belong to the set. The element c belongs to the set because its associated positions have bits set to 1

However, the algorithm does not prevent false positive. For instance, a bit may have been set to 1 during the insertion of previous elements and the query reports erroneously that the element belongs to the set.

The insertion of an elements depends on the h hash functions, therefore the time needed to add a new element is h (number of hash functions) and independent from size of the bit vector: asymptotic insertion time = O(h). However, the filter requires h bits for each element and is less effective that traditional bit array for small sets.

The probability of false positives decreases as the number n of inserted elements decreases and the size of the bitvector m, increases. The number of hash functions that minimizes the probability of false positives is defined by

The insertion of an elements depends on the h hash functions, therefore the time needed to add a new element is h (number of hash functions) and independent from size of the bit vector: asymptotic insertion time = O(h). However, the filter requires h bits for each element and is less effective that traditional bit array for small sets.

The probability of false positives decreases as the number n of inserted elements decreases and the size of the bitvector m, increases. The number of hash functions that minimizes the probability of false positives is defined by

**h = m.ln2/n**.## Implementation in Scala

The implementation relies on the MessageDigest

The first step is to define the BloomFilter class and its attributes

*java library class to generated the unique hash values. Ancillary methods and condition on methods arguments are ommitted for sake of clarity.*The first step is to define the BloomFilter class and its attributes

- length Number of entries in the filter (line 2)
- numHashs Number of hash functions (line 3)
- algorithm Hashing algorithm with
*SHA1*as default (line 4) - set Array of bytes for entries in the Bloom filter (line 6)
- digest Digest used to generate hash values (line 7)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 | class BloomFilter( length: Int, numHashs: Int, algorithm: String="SHA1") { val set = new Array[Byte](length) val digest = Try(MessageDigest.getInstance(algorithm)) def add(elements: Array[Any]): Int {} final def contains(el: Any): Boolean = {} private def hash(value: Int): Int {} private def getSet(el: Any): Array[Int] = {} } |

The digest using the message digest of the java library

The next step consists of defining the methods to add single generic element add(any: Any) line 8 and array of elements add(elements: Array[Any]) (line 2).

*java.security.MessageDigest*.The next step consists of defining the methods to add single generic element add(any: Any) line 8 and array of elements add(elements: Array[Any]) (line 2).

1 2 3 4 5 6 7 8 9 10 11 12 | // add an array of elements to the filter def add(elements: Array[Any]): Int = digest.map(_ => { elements.foreach( getSet(_).foreach(set(_) = 1) ) elements.size }).getOrElse(-1) @inline def add(any: Any): Boolean = this.add(Array[Any](any)) final def contains(any: Any): Boolean = digest.map( _ => !getSet(el).exists(set(_) !=1)) .getOrElse(false) |

The method contains (line 10) evaluates whether an element is contained in the filter. The method returns

- true if the filter very likely contains the element
- false if the filter DOES NOT contain this element

1 2 3 4 5 6 7 8 9 10 11 12 13 14 | def getSet(any: Any): Array[Int] = { val newSet = new Array[Int](numHashs) newSet.update(0, hash(any.hashCode)) getSet(newSet, 1) newSet } @scala.annotation.tailrec def getSet(values: Array[Int], index: Int): Unit = if( index < values.size) { values.update(index, hash(values(index-1))) getSet(values, index+1) // tail recursion } } |

Similarly to the add method, the getSet methods has two implementations

- Generate a new set from any new element (line 1)
- A recursive call to initialize the Bloom filter with an array if integers (line 9).

def hash(value: Int) : Int = digest.map(d => { d.reset d.update(value) Math.abs(new BigInteger(1, d.digest).intValue) % (set.size -1) }).getOrElse(-1)

The instance of the MessageDigestclass, digest generates a hash value using either MD5 or SHA-1 algorithm. Tail recursion is used as an alternative to the iterative process to generate the set.

The next code snippet implements a very simple implicit conversion from Int to Array[Byte] conversion (line 5)

The next code snippet implements a very simple implicit conversion from Int to Array[Byte] conversion (line 5)

1 2 3 4 5 6 7 8 9 10 | object BloomFilter { val NUM_BYTES = 4 val LAST_BYTE = NUM_BYTES -1 implicit def int2Bytes(value: Int) : Array[Byte] = Array.tabulate(NUM_BYTES)(n => { val offset = (LAST_BYTE - n) << LAST_BYTE ((value >>> offset) & 0xFF).toByte }) } |

The conversion relies on the manipulation of bits from a 32 bit Integer to 4 bytes (line 6 - 8). Alternatively, you may consider a conversion from a long value to a 8 byte array.

## Usage

This simple test consists of checking if a couple of values are indeed contains in the set. The filter will definitively reject 22 and very likely accept 23. If the objective is to confirm that 23 belongs to the set, then a full-fledged hash table would have to be used.

val filter = new BloomFilter(100, 100, "SHA") final val newValues = Array[Any](57, 97, 91, 23, 67,33) filter.add(newValues) println( filter.contains(22) ) println( filter.contains(23) )

Performance evaluation

References

Let's look at the behavior of the bloom filter under load. The test consists of adding 100,000,000 new random values then test if the filter contains a value (10,000) times. The test is run 10 times after a warm up of the JVM.

final val newValues = Array[Any](57, 97, 91, 23, 67,33) // Measure average time to add a new data set filter.add(Array.tabulate(size)(n => Random.nextInt(n + 1))) // Measure average time to test for a value. filter.contains(newValues(Random.nextInt(newValues.size)))

The first performance test evaluates the average time required to insert a new element into a Bloom filter which size range from 100M to 1Billion entries.

The second test evaluates the average search/query time for bloom filters with same range of size.

The second test evaluates the average search/query time for bloom filters with same range of size.

As expected the average time to load a new set of values and check the filter contains a specific value is fairly constant.

References

- Bloom filter Wikipedia
- github.com/prnicolas
- The Scala Programming Language - M. Odersky, L. Spoon, B.Venners - Artima 2007

Deep learning Domain is an AI function that mimics the workings of the human brain in processing data for use in detecting objects, recognizing speech, translating languages, and making decisions. IEEE Deep learning domain Deep Learning Projects for Final Year

ReplyDeletemimics the workings of the human brain in processing data for use in detecting objects, recognizing speech, translating languages, and making decisions.

Smaller than expected IEEE Final Year project centers ground for all fragments of CSE & IT engineers hoping to assemble. Final Year Projects for CSE It gives you tips and rules that is progressively critical to consider while choosing any final year project point.

Spring Framework has already made serious inroads as an integrated technology stack for building user-facing applications. Spring Framework Corporate TRaining the authors explore the idea of using Java in Big Data platforms.

Specifically, Spring Framework provides various tasks are geared around preparing data for further analysis and visualization. Spring Training in Chennai

The Angular Training covers a wide range of topics including Components, Angular Directives, Angular Services, Pipes, security fundamentals, Routing, and Angular programmability. The new Angular TRaining will lay the foundation you need to specialise in Single Page Application developer. Angular Training

I'm happy to see the considerable subtle element here!. water filter for home

ReplyDeletePositive site, where did u come up with the information on this posting? I'm pleased I discovered it though, ill be checking back soon to find out what additional posts you include. Fox Farm Nutrients

ReplyDelete

ReplyDeleteair jordankobe sneakerskyrie 6pg 1jordan 6yeezy boost 350 v2curry 6 shoesgolden goose sneakersyeezy boost 700bape