class Ish::CountMinSketch

Overview

A Count-Min sketch is a probabilistic data structure used for summarizing streams of data in sub-linear space.

Performance is modelled two parameters: epsilon (error factor) and delta (probability of error).

The sketch will provide estimate counts of the true frequency at a probability of at least 1 - delta. Due to internal collissions counts provide may exceed the true frequency, but are garunteed to never underestimate the true value.

true frequency <= estimate

Error within estimates is proportional to the total aggregate number of occurences seen, and to the epsilon. This also means significantly larger truth values can dwarf smaller the error term producing more accurate estimates for items with the largest counts.

Further information and proofs can be found at http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf

OPTIMIZE support parallel hashing / increment / count operations when supported by Crystal

Defined in:

ish/count_min_sketch.cr

Constructors

Instance Method Summary

Constructor Detail

def self.new(epsilon, delta, seed = nil) #

Creates a count-min sketch with the specified characteristics.

When choosing a configuration, you are often trying to minimize the error term of the estimate. Acceptable errors in estimation fall within a range which is a factor of epsilon. Smaller values of epsilon should produce sketch configurations that provide estimates closer to the true values. However, this will increase the space required by the sketch.

Error in estimation is also proportional to the total number of distinct items counted. A factor in the error term is a summation of total number of occurences of each distinct item. Another way to reduce the error term is by counting fewer things.

Alternatively, you may try to reduce the frequency error and increase the chance of success. This is modeled as 1.0 - delta. Increasing this number can produce sketch configurations with reduced error rates, but that will require more memory.

A good rule of thumb for finding a starting configuration is to try an epsilon with as many significant digits as you have digits in your expected volume (e.g. given an expected volume of 5000 a good starting point for epsilon might be 0.0001).

An excellent resource that provides visualisation of different configurations can be found at http://crahen.github.io/algorithm/stream/count-min-sketch-point-query.html


[View source]

Instance Method Detail

def <<(item) #

Inserts item, incrementing its count by 1.


[View source]
def count(item) #

Retrieves a frequency estimate for item.


[View source]
def depth #

Gets the depth of the sketch.


[View source]
def increment(item, amount : UInt32 = 1_u32) #

Increases the count of item within the sketch by amount (default 1).


[View source]
def width #

Gets the width of the sketch.


[View source]