incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Incubator Wiki] Update of "DataSketchesProposal" by Lee Rhodes
Date Wed, 06 Mar 2019 19:40:22 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "DataSketchesProposal" page has been changed by Lee Rhodes:

   * Small Size. Relative to the size of the input stream, sketches retain a small amount
of information about the state of the stream observed so far. Sketch sizes are often orders-of-magnitude
smaller, typically just kilobytes in size, even for very long input streams. The smaller the
retained information in the sketch the faster it can be examined or processed.
   * Sublinear space growth. The sublinear property means that as the number of items in the
input stream, n, grows, the size of the sketch must not grow proportionately, it must grow
much less than that. To be useful, sketch algorithms should grow logarithmically or sub-logarithmically
with the size of the input stream, or not at all. Not only must the sketch start small, it
must remain small as n increases.
   * Mergeability. In order to be useful in large distributed systems, sketches must be mergeable.
This means that the internal data structures of two sketches can be combined in a linear or
“additive” fashion. Let A and B be streams and + indicate concatenation:
- .. sketch.merge(sketch(A), sketch(B)) ≅ sketch(A + B).
+   . sketch.merge(sketch(A), sketch(B)) ≅ sketch(A + B).
   . Some sketches extend mergeability to include operations other than concatenation or union,
such as set intersection and set difference. Mergeability enables input streams to be automatically
processed in a distributed and parallel manner: split the stream up arbitrarily across many
machines, process each piece separately, and merge the resulting sketches.
   * Mathematically proven error properties. It is important that results obtained from a
sketch have clearly defined useful error properties. For example, a sketch with a specific
configuration might specify that with 95% confidence, the exact answer will be within plus-or-minus
1% of the estimated result produced by the sketch. In order to make such a statement the sketch
must essentially be stream independent, which is to say that the 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message