incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "DataSketchesPorposal" by chenliang613
Date Sat, 02 Mar 2019 02:49:26 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "DataSketchesPorposal" page has been changed by chenliang613:
https://wiki.apache.org/incubator/DataSketchesPorposal?action=diff&rev1=7&rev2=8

  
  We want to encourage more synergy with other data processing platforms. In addition to the
fundamental capabilities of sketching mentioned above, the DataSketches library provides some
additional capabilities specifically designed for large data platforms.
   * Binary compatibility across language, platform and history.  Binary compatibility means
that the stored image of a sketch can be fully interpreted and used by the same type sketch
in a different language (e.g., C++, Python) or on a different platform.  Our guarantee is
that sketches that were produced by the earliest versions of our code can still be read and
interpreted by the latest versions of our code.  This is critically important for systems
that might store years worth of sketches, because it is vastly more efficient than attempting
to store years worth of raw data.  We have found that this property is even vastly more important
than backward compatibility of the APIs.  Unfortunately, APIs do have to change and evolve,
and while we try hard to avoid this, it sometimes is required.
-  * * Accommodations for specific system architecture or language requirements.   Through
our work with the Druid team we learned the importance of being able to operate sketches off
the java heap.  As a result, the sketches that we have currently integrated into Druid’s
aggregation functions have this off-heap (or Direct) capability.  By operate we mean that
the sketch is able to be updated, queried, and merged without having to be deserialized on
to the Java heap first.  Our work with PostgreSQL (C++) team has taught us the importance
of enabling user specification of malloc() and free() which can be customized to the environment.
 
+  * Accommodations for specific system architecture or language requirements.   Through our
work with the Druid team we learned the importance of being able to operate sketches off the
java heap.  As a result, the sketches that we have currently integrated into Druid’s aggregation
functions have this off-heap (or Direct) capability.  By operate we mean that the sketch is
able to be updated, queried, and merged without having to be deserialized on to the Java heap
first.  Our work with PostgreSQL (C++) team has taught us the importance of enabling user
specification of malloc() and free() which can be customized to the environment.  
+ 
+ We believe that having DataSketches as an Apache project will provide an immediate, worthwhile,
and substantial contribution to the open source community, will have a better opportunity
to provide a meaningful contribution to both the science and engineering of sketching algorithms,
and integrate with other Apache projects.  In addition, this is a significant opportunity
for Apache to be the "go-to" destination for users that want to leverage this exciting technology.
  
  == Apache DataSketches as a Top-Level Project ==
  Because successful development and implementation of high-performance sketches involves
knowledge of advanced mathematics and statistics, there might be a tendency to associate the
Apache DataSketches project with Apache Commons-Math or Apache Commons-Statistics.  This,
I believe, would be a mistake for a couple of reasons.
- 
- Language Support. The Apache Commons-Math, Apache Commons-Statistics, and Apache Commons-Lang
libraries are exclusively Java libraries by definition.  The DataSketches library supports
multiple languages (So far: Java, C++, Python). 
+  * Language Support. The Apache Commons-Math, Apache Commons-Statistics, and Apache Commons-Lang
libraries are exclusively Java libraries by definition.  The DataSketches library supports
multiple languages (So far: Java, C++, Python). 
- Visibility to data processing platform developers.  Sketching is a relatively new field
in the arsenal of tools available to system developers.  Burying this project under the commons
math or commons statistics may make it harder to find. We want to encourage synergy with the
various platforms to learn to leverage this technology and to provide feedback to us on capabilities
in the design of the sketches themselves.
+  * Visibility to data processing platform developers.  Sketching is a relatively new field
in the arsenal of tools available to system developers.  Burying this project under the commons
math or commons statistics may make it harder to find. We want to encourage synergy with the
various platforms to learn to leverage this technology and to provide feedback to us on capabilities
in the design of the sketches themselves.
- Sketches solve difficult computational problems that are desirable queries in large data
processing systems, such as unique counts, quantiles, CDFs, PMFs, Histograms, Heavy-hitters
(TopN), etc.  And they solve these problems in a mergeable and streaming way, which makes
them suitable for real-time queries.
+  * Sketches solve difficult computational problems that are desirable queries in large data
processing systems, such as unique counts, quantiles, CDFs, PMFs, Histograms, Heavy-hitters
(TopN), etc.  And they solve these problems in a mergeable and streaming way, which makes
them suitable for real-time queries.
  
  == Initial Goals ==
  We are breaking our initial goals into short-term (2-6 months) and intermediate to longer-term
( 6 months to 2 years):
  
  Our short-term goals include:
  
- Understanding and adapting to the Apache development process and structures.
+  * Understanding and adapting to the Apache development process and structures.
- Start refactoring codebase and move various DataSketches repositories code to Apache Git
repository.
+  * Start refactoring codebase and move various DataSketches repositories code to Apache
Git repository.
- Continue development of new features, functions, and fixes.
+  * Continue development of new features, functions, and fixes.
- Specific sub-projects (e.g., C++ and Python) will continue to be developed and expanded.
+  * Specific sub-projects (e.g., C++ and Python) will continue to be developed and expanded.
  
  The intermediate to longer term goals include:
  
- Completing the design and implementation of the C++ sketches to complement what is already
available in Java, and the Python wrappers of those C++ sketches.
+  * Completing the design and implementation of the C++ sketches to complement what is already
available in Java, and the Python wrappers of those C++ sketches.
- Expanding the C++ build framework to include Windows and the popular Linux variants.
+  * Expanding the C++ build framework to include Windows and the popular Linux variants.
- Continued engagement with the scientific research community on the development of new algorithms
for computationally difficult problems that heretofore have not had a sketching solution.
+  * Continued engagement with the scientific research community on the development of new
algorithms for computationally difficult problems that heretofore have not had a sketching
solution.
  
  == Current Status ==
  The DataSketches GitHub project has been quite successful.  As of this writing (Feb, 2019)
the number of downloads measured by the Nexus Repository Manager at https://oss.sonatype.org
has grown by nearly a factor of 10 over the past year to about 55 thousand per month. The
DataSketches/sketches-core repository has about 560 stars and 141 forks, which is pretty good
for a highly specialized library.
  
  == Development Practices ==
  === Source Control ===
- All of our developers have extensive experience with Git version control and follow accepted
practices for use of Pull Requests (PRs), code reviews and commits to master, for example.
 
+ All of our developers have extensive experience with Git version control and follow accepted
practices for use of Pull Requests (PRs), code reviews and commits to master, for example.

- 
- ---- /!\ '''Edit conflict - other version:''' ----
  
  === Testing === 
  Sketches, by their nature are probabilistic programs and don’t necessarily behave deterministically.
 For some of the sketches we intentionally insert random noise into the code as this gives
us the mathematical properties that we need to guarantee accuracy.  This can make the behavior
of these algorithms quite unintuitive and provides significant challenges to the developer
who wishes to test these algorithms for correctness. As a result, our testing strategy includes
two major components: unit tests, and characterization tests.  
@@ -120, +119 @@

  === Unit Testing ===
  Our unit tests are primarily quick tests to make sure that we exercise all critical paths
in the code and that key branches are executed correctly. It is important that they execute
relatively fast as they are generally run on every code build. The sketches-core repository
alone has about 22 thousand statements, over 1300 unit tests and code coverage of about 98.2%
as measured by Atlassian/Clover.  It is our goal for all of our code repositories that are
used in production that they have code coverage greater than 90%.
  
- 
- ---- /!\ '''Edit conflict - your version:''' ----
- 
- === Testing === 
- Sketches, by their nature are probabilistic programs and don’t necessarily behave deterministically.
 For some of the sketches we intentionally insert random noise into the code as this gives
us the mathematical properties that we need to guarantee accuracy.  This can make the behavior
of these algorithms quite unintuitive and provides significant challenges to the developer
who wishes to test these algorithms for correctness. As a result, our testing strategy includes
two major components: unit tests, and characterization tests.  
- 
- === Unit Testing ===
- Our unit tests are primarily quick tests to make sure that we exercise all critical paths
in the code and that key branches are executed correctly. It is important that they execute
relatively fast as they are generally run on every code build. The sketches-core repository
alone has about 22 thousand statements, over 1300 unit tests and code coverage of about 98.2%
as measured by Atlassian/Clover.  It is our goal for all of our code repositories that are
used in production that they have code coverage greater than 90%.
- 
- 
- ---- /!\ '''End of edit conflict''' ----
  === Characterization Testing ===
  In order to test the probabilistic methods that are used to interpret the stochastic behaviors
of our sketches we have a separate characterization repository that is dedicated to this.
 To measure accuracy, for example, requires running thousands of trials at each of many different
points along the domain axis. Each trial compares its estimated results against a known exact
result producing an error for that trial.  These error measurements are then fed into our
Quantiles sketch to capture the actual distribution of error at that point along the axis.
We then select quantile contours across all the distributions at points along the axis.  These
contours can then be plotted to reveal the shape of the actual error distribution. These distributions
are not at all Gaussian, in fact they can be quite complex.  Nonetheless, these distributions
are then checked against our statistical guarantees inherent to the specific sketch algorithm
and its parameters. There are many examples of these characterization error distributions
on our website. The runtimes of these tests can be very long and can range from many minutes
to hours, and some can run for days.  Currently, we have separate characterization repositories
for Java and C++ / Python.   
  

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message