 ### C++ Program to Implement CountMinSketch

Description:

This program is to implement Count-Min Sketch. Count-Min Sketch is a probabilistic data structure. This program can be used to summarize a data in many different ways.

Explanation:

This is used to record a numerical value associated with each element. For example, the number of occurrences of the element in a stream.

Algorithm:

1.      N different hash functions and N different lists all of which are indexed by the outputs of these hash functions are created.

2.      Initially, all of the values in the lists are set to Zero.

3.      When we go through the stream we increase the count of an element, and then increment all corresponding N fields in all the different lists (The indexes which are given by the hash values of the element.)

4.      Then the final step to find the count of the element, we take the minimum of  the N fields that correspond to the element (The fields are decided by the values given by the hash function.)

Detailed Algorithm:

Different functions are used to implement different tasks. The functions are INITIALIZE, UPDATE and  ESTIMATE. The pseudo code for these functions are as follows:

INITIALIZE:

This function id used the initialize the array C of w x d counters to Zero, and picks values for the hash function based on the prime p.

Algorithm: INITIALIZE( w,d,p)

1.       C[1,1]....C[d,w] <---0

2.      for j <--- 1 to d

3.              Pick aj, bj uniformly from [1....p]

4.             N=0

UPDATE(i,c):

Updates the total count N with c, and the loop hashes i to its counter in each row, and updates the counter there.

Algorithm: UPDATE(i,c)

1.      N=N+c

2.      for j<--- 1 to d

3.              hj(x)=(aj x i + bj ) mod w

4.              C[j,hj(i)]=C[j,hj(i)] + c

ESTIMATE(i):

It is identical to this loop: given i, perform some hashing on i keeping track of the value of the counter in each row, then return the minimum counter. This is the smallest value of  C[j,hj(i)] over the d values of j.

Algorithm:ESTIMATE(i)

1.      e= infinity

2.      for j<---1 to d

3.           hj(x)=(aj x i + bj ) mod w

4.           e=min(e,C[j,hj(i)])

5.      return e

Program

Program of the Header file of Count-Min Sketch function

// some constants
# define LONG_PRIME 4294967311l
# define MIN(a,b) (a < b ? a : b)

/** Definition of CountMinSketch class **/
class CountMinSketch {

// width, depth

unsigned int w,d;

// eps (for error), 0.01 < eps < 1
// the smaller the better
float eps;

// gamma (probability for accuracy), 0 < gamma < 1
// the bigger the better
float gamma;

// aj, bj in Z_p
// both elements of fild Z_p used in generation of hash
// function
unsigned int aj, bj;

// total count so far
unsigned int total;

// array of arrays of counters
int **C;

// array of hash values for a particular item
// contains two element arrays {aj,bj}
int **hashes;

// generate "new" aj,bj
void gen(int **hashes, int i);

public:

// constructor
CountMinSketch(
float eps, float gamma);

// update item (int) by count c
void update(int item, int c);

// update item (string) by count c
void update(const char *item, int c);

// estimate count of item i and return count
unsigned int estimate(int item);

unsigned int estimate(const char *item);

unsigned int totalcount();

// generates a hash value for a string
// same as djb2 hash function
unsigned int hashstr(const char *str);

// destructor
~CountMinSketch();

};

Test program to run the code

# include <bits/stdc++.h> //This header includes all the libraries.
# include "count_min_sketch.hpp" // This is used for the count-min sketch

using namespace std;

int main() {
// Count for ar_str[i] is i1+i2+...
// where i's are the positions where ar_str[i] occurs
const char *ar_str[] = {
"ayush", "done", "kind", "ayush", "alice",
"one", "singhal", "let", "we", "singhal",
"alice", "in", "wonderland", "we", "singhal",
"singhal", "done", "ayush", "fun", "tie"
};

CountMinSketch c(0.01, 0.1);
unsigned int i, total = 0;
map<const char *, int> mapitems;
map<const char *, int>::const_iterator it;

for (i = 0; i < 15; i++) {
if ((it = mapitems.find(ar_str[i]))!=mapitems.end()) {
mapitems[ar_str[i]] += i;
}
else {
mapitems[ar_str[i]] = i;
}
c.update(ar_str[i], i);
total += i;
}

// 1. test for items in ar_str
// note: since probablistic (and not deterministic)
// might not be accurate sometimes
for (it = mapitems.begin(); it != mapitems.end(); it++) {
if (c.estimate(it->first) != mapitems[it->first]) {
cout << "Incorrect count for " << it->first <<";error: " << abs(int(c.estimate(it->first)-mapitems[it->first]))<< endl;
}
else {
cout << "Correct count for '" << it->first <<"' ;count: " << c.estimate(it->first) << endl;
}
}
cout << "c.totalcount()==total? "<< (c.totalcount() == total ? "True" : "False") << "Sketch Total: " << c.totalcount() << endl;

// 2. test for items not in ar_str
cout << "Testing for strings not in ar_str..." << endl;
cout << c.estimate("ajay") << endl;
cout << c.estimate("programming") << endl;

// 3. test for items in ar_str
cout << "Testing for strings in ar_str..." << endl;
cout << c.estimate("ayush") << endl;
cout << c.estimate("singhal") << endl;
return 0;
}

Output :

0

0

3

4

(This is one of the output. The output depends on the Hash function. The output maybe more than these values but not less than the given value.)

This function uses sub linear space. 