Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.
The "DataSketchesPorposal" page has been changed by chenliang613:
https://wiki.apache.org/incubator/DataSketchesPorposal?action=diff&rev1=3&rev2=4
= Apache DataSketches Proposal =
== Abstract ==
 DataSketches.github.io is an open source, highperformance library of stochastic streaming
algorithms commonly called "sketches" in the data sciences. Sketches are small, stateful programs
that process massive data as a stream and can provide approximate answers, with mathematical
guarantees, to computationally difficult queries ordersofmagnitude faster than traditional,
exact methods.
+ DataSketches is an open source, highperformance library of stochastic streaming algorithms
commonly called "sketches" in the data sciences. Sketches are small, stateful programs that
process massive data as a stream and can provide approximate answers, with mathematical guarantees,
to computationally difficult queries ordersofmagnitude faster than traditional, exact methods.
 This proposal is to move DataSketches.github.io in to the Apache Software Foundation(ASF)
incubation process, transferring ownership of its copyright intellectual property to the ASF.
Thereafter, it would be officially known as Apache DataSketches and its evolution and governance
would come under the rules and guidance of the ASF.
+ This proposal is to move DataSketches.github.io in to the Apache Software Foundation(ASF)
incubation process, transferring ownership of its copyright intellectual property to the ASF.
Thereafter, it would be officially known as Apache DataSketches and its evolution and governance
would come under the rules and guidance of the ASF.
== Introduction ==
The DataSketches library contains carefully crafted implementations of sketch algorithms
that meet rigorous standards of quality and performance and provide capabilities required
for largescale production systems that must process and analyze massive data. The DataSketches
core repository is written in Java with a parallel core repository written in C++ that includes
Python wrappers. The DataSketches library also includes special repositories for extending
the core library for Apache Hive and Apache Pig. The sketches developed in the different languages
share a common binary storage format so that sketches created and stored in Java, for example,
can be fully used in C++, and visa versa. Because the stored sketch "images" are just a "blob"
of bytes (similar to picture images), they can be shared across many different systems, languages
and platforms.
@@ 13, +13 @@
The DataSketches website includes general tutorials, a comprehensive research section with
references to relevant academic papers, extensive examples for using the core library directly
as well as examples for accessing the library in Hive, Pig, and Apache Spark.
The DataSketches library also includes a characterization repository for long running test
programs that are used for studying accuracy and performance of these sketches over wide ranges
of input variables. The data produced by these programs is used for generating the many performance
plots contained in the documentation website and for academic publications.
+
The code repositories used for production are versioned and published to Maven Central on
periodic intervals as the library evolves.
@@ 30, +31 @@
It is important to note that our internal systems at Yahoo use the current public GitHub
open source DataSketches library and not an internal version of the code.
The close collaboration of scientific research and engineering development experience with
actual massivedata processing systems has also produced new research publications in the
field of stochastic streaming algorithms, for example:
 Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty, Lee Rhodes, and Justin Thaler.
A highperformance algorithm for identifying frequent items in data streams. In ACM IMC 2017.
+ * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty, Lee Rhodes, and Justin Thaler.
A highperformance algorithm for identifying frequent items in data streams. In ACM IMC 2017.
 Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin Thaler. A framework for estimating
stream expression cardinalities. In *EDBT/ICDT Proceedings ‘16 *, pages 6:1–6:17, 2016.
+ * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin Thaler. A framework for estimating
stream expression cardinalities. In *EDBT/ICDT Proceedings ‘16 *, pages 6:1–6:17, 2016.
 Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient Frequent Directions Algorithm for
Sparse Matrices. In ACM SIGKDD Proceedings ‘16, pages 845854, 2016.
+ * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient Frequent Directions Algorithm
for Sparse Matrices. In ACM SIGKDD Proceedings ‘16, pages 845854, 2016.
 Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal quantile approximation in streams.
In IEEE FOCS Proceedings ‘16, pages 71–78, 2016.
+ * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal quantile approximation in streams.
In IEEE FOCS Proceedings ‘16, pages 71–78, 2016.
 Kevin J Lang. Back to the future: an even more nearly optimal cardinality estimation algorithm.
arXiv preprint https://arxiv.org/abs/1708.06839, 2017.
+ * Kevin J Lang. Back to the future: an even more nearly optimal cardinality estimation algorithm.
arXiv preprint https://arxiv.org/abs/1708.06839, 2017.
 Edo Liberty. Simple and deterministic matrix sketching. In ACM KDD Proceedings ‘13, pages
581– 588, 2013.
+ * Edo Liberty. Simple and deterministic matrix sketching. In ACM KDD Proceedings ‘13,
pages 581– 588, 2013.
 Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan Ullman. Space lower bounds
for itemset frequency sketches. In ACM PODS Proceedings ‘16, pages 441–454, 2016.
+ * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan Ullman. Space lower bounds
for itemset frequency sketches. In ACM PODS Proceedings ‘16, pages 441–454, 2016.
 Michael Mitzenmacher, Thomas Steinke, and Justin Thaler. Hierarchical heavy hitters with
the space saving algorithm. In SIAM ALENEX Proceedings ‘12, pages 160–174, 2012.
+ * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler. Hierarchical heavy hitters with
the space saving algorithm. In SIAM ALENEX Proceedings ‘12, pages 160–174, 2012.
== The Rationale for Sketches ==
 In the analysis of big data there are often problem queries that don’t scale because they
require huge compute resources and time to generate exact results. Examples include count
distinct, quantiles, most frequent items, joins, matrix computations, and graph analysis.
 If we can loosen the requirement of “exact” results from our queries and be satisfied
with approximate results, within some well understood bounds of error, there is an entire
branch of mathematics and data science that has evolved around developing algorithms that
can produce approximate results with mathematically welldefined error properties.

 With the additional requirements that these algorithms must be small (compared to the size
of the input data), sublinear (the size of the sketch must grow at a slower rate than the
size of the input stream), streaming (they can only touch each data item once), and mergeable
(suitable for distributed processing), defines a class of algorithms that can be described
as small, stochastic, streaming, sublinear mergeable algorithms, commonly called sketches
(they also have other names, but we will use the term sketches from here on).

 To qualify as a sketching algorithm we believe it must exhibit the following major properties:
 Streaming. To be truly streaming a sketch can only “touch” each item of the stream once.
There is no second chance.
 Amortized per item processing time is constant. Sketch algorithms are designed so that
the processing time per item is essentially independent of n, the number of items (or size)
of the input stream. There may be specific instances where upon receiving a new item, the
sketch needs to resize or rebuild its internal data structures to accommodate more items,
but overall, these events are rare, so that when amortized over n, the computational cost
is effectively O(1) with a small hidden constant.
 Small Size. Relative to the size of the input stream, sketches retain a small amount of
information about the state of the stream observed so far. Sketch sizes are often ordersofmagnitude
smaller, typically just kilobytes in size, even for very long input streams. The smaller the
retained information in the sketch the faster it can be examined or processed.
 Sublinear space growth. The sublinear property means that as the number of items in the
input stream, n, grows, the size of the sketch must not grow proportionately, it must grow
much less than that. To be useful, sketch algorithms should grow logarithmically or sublogarithmically
with the size of the input stream, or not at all. Not only must the sketch start small, it
must remain small as n increases.
 Mergeability. In order to be useful in large distributed systems, sketches must be mergeable.
This means that the internal data structures of two sketches can be combined in a linear or
“additive” fashion. Let A and B be streams and + indicate concatenation:
 sketch.merge(sketch(A), sketch(B)) ≅ sketch(A + B).
 Some sketches extend mergeability to include operations other than concatenation or union,
such as set intersection and set difference. Mergeability enables input streams to be automatically
processed in a distributed and parallel manner: split the stream up arbitrarily across many
machines, process each piece separately, and merge the resulting sketches.
 Mathematically proven error properties. It is important that results obtained from a sketch
have clearly defined useful error properties. For example, a sketch with a specific configuration
might specify that with 95% confidence, the exact answer will be within plusorminus 1% of
the estimated result produced by the sketch. In order to make such a statement the sketch
must essentially be stream independent, which is to say that the sketch must deliver on its
claimed error properties independent of the length, the order, the range of values, and the
distribution of the items in the input stream. There is always the possibility that with certain
stream lengths, orders, or distributions, that the actual sketch error might be considerably
better, even zero, but the claimed error properties still hold.
 In order to make claims like this, the theory behind sketches must include rigorous mathematical
proofs that demonstrate that, if implemented correctly, the sketch will produce results with
the claimed error properties regardless of the input stream.
 The sketch is essentially a complex state machine and combined with some arbitrary input
stream defines a stochastic process. We then apply probabilistic methods to interpret the
states of the stochastic process in order to extract useful information about the input stream
itself. The resulting information will be approximate, but we also use additional probabilistic
methods to extract an estimate of the likely probability distribution of error.
 The error of the sketch result for a given input stream is a random variable, and “mathematically
proven error properties” means that this random variable has a probability distribution
with a mean and a variance that is wellunderstood. More generally, the estimate returned
by a sketch will be within some error ε of the true answer to any query, with a specified
statistical confidence. The definition of this ε error is determined by the mathematics underlying
the specific sketch and can be either additive (i.e., the sketch returns the correct result
in the range result ± ε, or multiplicative (i.e., the sketch returns the correct result
in the range result · (1 ± ε)).
 The mathematical proofs should also include the merge process and associated algorithms.
It is important to note that there must be no error penalty for merging. I.e., the error properties
of the merged sketch must be no worse than the error of a single sketch produced directly
from the concatenated streams.

 Because sketches are small they can be processed extremely fast, often many ordersofmagnitude
faster than traditional exact computations. For interactive queries there may not be other
viable alternatives, and in the case of realtime analysis, sketches are the only known solution.

 For any system that needs to extract useful information from massive data sketches are essential
tools that should be tightly integrated into the system’s analysis capabilities. This technology
has helped Yahoo successfully reduce data processing times from days to hours or minutes on
a number of its internal platforms and has enabled subsecond queries on realtime platforms
that would have been infeasible without sketches.

 == Sketch Algorithms and Science ==
 There is a significant scientific contribution here that is defining the state machine,
understanding the resulting stochastic process, developing the probabilistic methods, and
proving mathematically, that it all works! This is why the scientific contributors to this
project are a critical and strategic component to our success. The development engineers
translate the concepts of the proposed state machine and probabilistic methods into productionquality
code. Even more important, they work closely with the scientists, feeding back system and
user requirements, which leads not only to superior product design, but to new science as
well. A number of scientific papers our members have published (see above) is a direct result
of this close collaboration.

 Sketching (a.k.a. stochastic, streaming, sublinear mergeable algorithms) is really a distinct
area of computer science / computational statistics. It is the science behind sketches that
distinguishes sketch algorithms from empirical algorithms that tend to be datasensitive,
and cannot provide apriori (or even a posteriori) bounds on error.

 Sketching is a rather distinct field in that the underlying theory of these algorithms is
usually only taught in graduatelevel elective courses and only at a few universities that
have professors that have experience in this field. For example, a freshlyminted Ph.D. in
Machine Learning, does not imply that they have any exposure to this area. Even coursework
in algorithms does not imply that this area is taught because it is an advanced topic. There
are specific academic workshops on streaming sublinear algorithms, but no dedicated conferences,
yet.
== The Rationale for Apache DataSketches ==
Other open source implementations of sketch algorithms can be found on the Internet. However,
we have not yet found any open source implementations that are as comprehensive, engineered
with the quality required for production systems, and with usable and guaranteed error properties.
Large Internet companies, such as Google and Facebook, have published papers on sketching,
however, their implementations of their published algorithms are proprietary and not available
as open source.

To unsubscribe, email: cvsunsubscribe@incubator.apache.org
For additional commands, email: cvshelp@incubator.apache.org
