class Ish::CountMinSketch
- Ish::CountMinSketch
- Reference
- Object
Overview
A Count-Min sketch is a probabilistic data structure used for summarizing streams of data in sub-linear space.
Performance is modelled two parameters: epsilon (error factor) and delta (probability of error).
The sketch will provide estimate counts of the true frequency at a
probability of at least 1 - delta
. Due to internal collissions counts
provide may exceed the true frequency, but are garunteed to never
underestimate the true value.
true frequency <= estimate
Error within estimates is proportional to the total aggregate number of occurences seen, and to the epsilon. This also means significantly larger truth values can dwarf smaller the error term producing more accurate estimates for items with the largest counts.
Further information and proofs can be found at http://dimacs.rutgers.edu/~graham/pubs/papers/cmencyc.pdf
OPTIMIZE support parallel hashing / increment / count operations when supported by Crystal
Defined in:
ish/count_min_sketch.crConstructors
-
.new(epsilon, delta, seed = nil)
Creates a count-min sketch with the specified characteristics.
Instance Method Summary
-
#<<(item)
Inserts item, incrementing its count by 1.
-
#count(item)
Retrieves a frequency estimate for item.
-
#depth
Gets the depth of the sketch.
-
#increment(item, amount : UInt32 = 1_u32)
Increases the count of item within the sketch by amount (default 1).
-
#width
Gets the width of the sketch.
Constructor Detail
Creates a count-min sketch with the specified characteristics.
When choosing a configuration, you are often trying to minimize the error term of the estimate. Acceptable errors in estimation fall within a range which is a factor of epsilon. Smaller values of epsilon should produce sketch configurations that provide estimates closer to the true values. However, this will increase the space required by the sketch.
Error in estimation is also proportional to the total number of distinct items counted. A factor in the error term is a summation of total number of occurences of each distinct item. Another way to reduce the error term is by counting fewer things.
Alternatively, you may try to reduce the frequency error and increase the
chance of success. This is modeled as 1.0 - delta
. Increasing this number
can produce sketch configurations with reduced error rates, but that will
require more memory.
A good rule of thumb for finding a starting configuration is to try an
epsilon with as many significant digits as you have digits in your expected
volume (e.g. given an expected volume of 5000 a good starting point for
epsilon might be 0.0001
).
An excellent resource that provides visualisation of different configurations can be found at http://crahen.github.io/algorithm/stream/count-min-sketch-point-query.html
Instance Method Detail
Increases the count of item within the sketch by amount (default 1).