From cvs-return-34055-archive-asf-public=cust-asf.ponee.io@incubator.apache.org Sat Mar 2 02:58:47 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id A3C48180647 for ; Sat, 2 Mar 2019 03:58:46 +0100 (CET) Received: (qmail 40794 invoked by uid 500); 2 Mar 2019 02:58:45 -0000 Mailing-List: contact cvs-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list cvs@incubator.apache.org Received: (qmail 40777 invoked by uid 99); 2 Mar 2019 02:58:45 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Mar 2019 02:58:45 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 07B78180E2C for ; Sat, 2 Mar 2019 02:58:45 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -9.8 X-Spam-Level: X-Spam-Status: No, score=-9.8 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_NUMSUBJECT=0.5, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, USER_IN_DEF_SPF_WL=-7.5] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id BQeG-byYcjPz for ; Sat, 2 Mar 2019 02:58:43 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 6465F5F30F for ; Sat, 2 Mar 2019 02:49:28 +0000 (UTC) Received: from moin-vm.apache.org (moin-vm.apache.org [163.172.69.106]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 982F2E01AC for ; Sat, 2 Mar 2019 02:49:27 +0000 (UTC) Received: from moin-vm.apache.org (localhost [IPv6:::1]) by moin-vm.apache.org (ASF Mail Server at moin-vm.apache.org) with ESMTP id EEB1C80022 for ; Sat, 2 Mar 2019 02:49:26 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Sat, 02 Mar 2019 02:49:26 -0000 Message-ID: <155149496659.26101.10558026222833323444@moin-vm.apache.org> Subject: =?utf-8?q?=5BIncubator_Wiki=5D_Update_of_=22DataSketchesPorposal=22_by_ch?= =?utf-8?q?enliang613?= Auto-Submitted: auto-generated Dear Wiki user, You have subscribed to a wiki page or wiki category on "Incubator Wiki" for= change notification. The "DataSketchesPorposal" page has been changed by chenliang613: https://wiki.apache.org/incubator/DataSketchesPorposal?action=3Ddiff&rev1= =3D7&rev2=3D8 = We want to encourage more synergy with other data processing platforms. I= n addition to the fundamental capabilities of sketching mentioned above, th= e DataSketches library provides some additional capabilities specifically d= esigned for large data platforms. * Binary compatibility across language, platform and history. Binary co= mpatibility means that the stored image of a sketch can be fully interprete= d and used by the same type sketch in a different language (e.g., C++, Pyth= on) or on a different platform. Our guarantee is that sketches that were p= roduced by the earliest versions of our code can still be read and interpre= ted by the latest versions of our code. This is critically important for s= ystems that might store years worth of sketches, because it is vastly more = efficient than attempting to store years worth of raw data. We have found = that this property is even vastly more important than backward compatibilit= y of the APIs. Unfortunately, APIs do have to change and evolve, and while= we try hard to avoid this, it sometimes is required. - * * Accommodations for specific system architecture or language requirem= ents. Through our work with the Druid team we learned the importance of b= eing able to operate sketches off the java heap. As a result, the sketches= that we have currently integrated into Druid=E2=80=99s aggregation functio= ns have this off-heap (or Direct) capability. By operate we mean that the = sketch is able to be updated, queried, and merged without having to be dese= rialized on to the Java heap first. Our work with PostgreSQL (C++) team ha= s taught us the importance of enabling user specification of malloc() and f= ree() which can be customized to the environment. = + * Accommodations for specific system architecture or language requiremen= ts. Through our work with the Druid team we learned the importance of bei= ng able to operate sketches off the java heap. As a result, the sketches t= hat we have currently integrated into Druid=E2=80=99s aggregation functions= have this off-heap (or Direct) capability. By operate we mean that the sk= etch is able to be updated, queried, and merged without having to be deseri= alized on to the Java heap first. Our work with PostgreSQL (C++) team has = taught us the importance of enabling user specification of malloc() and fre= e() which can be customized to the environment. = + = + We believe that having DataSketches as an Apache project will provide an = immediate, worthwhile, and substantial contribution to the open source comm= unity, will have a better opportunity to provide a meaningful contribution = to both the science and engineering of sketching algorithms, and integrate = with other Apache projects. In addition, this is a significant opportunity= for Apache to be the "go-to" destination for users that want to leverage t= his exciting technology. = =3D=3D Apache DataSketches as a Top-Level Project =3D=3D Because successful development and implementation of high-performance ske= tches involves knowledge of advanced mathematics and statistics, there migh= t be a tendency to associate the Apache DataSketches project with Apache Co= mmons-Math or Apache Commons-Statistics. This, I believe, would be a mista= ke for a couple of reasons. - = - Language Support. The Apache Commons-Math, Apache Commons-Statistics, and= Apache Commons-Lang libraries are exclusively Java libraries by definition= . The DataSketches library supports multiple languages (So far: Java, C++,= Python). = + * Language Support. The Apache Commons-Math, Apache Commons-Statistics, = and Apache Commons-Lang libraries are exclusively Java libraries by definit= ion. The DataSketches library supports multiple languages (So far: Java, C= ++, Python). = - Visibility to data processing platform developers. Sketching is a relati= vely new field in the arsenal of tools available to system developers. Bur= ying this project under the commons math or commons statistics may make it = harder to find. We want to encourage synergy with the various platforms to = learn to leverage this technology and to provide feedback to us on capabili= ties in the design of the sketches themselves. + * Visibility to data processing platform developers. Sketching is a rel= atively new field in the arsenal of tools available to system developers. = Burying this project under the commons math or commons statistics may make = it harder to find. We want to encourage synergy with the various platforms = to learn to leverage this technology and to provide feedback to us on capab= ilities in the design of the sketches themselves. - Sketches solve difficult computational problems that are desirable querie= s in large data processing systems, such as unique counts, quantiles, CDFs,= PMFs, Histograms, Heavy-hitters (TopN), etc. And they solve these problem= s in a mergeable and streaming way, which makes them suitable for real-time= queries. + * Sketches solve difficult computational problems that are desirable que= ries in large data processing systems, such as unique counts, quantiles, CD= Fs, PMFs, Histograms, Heavy-hitters (TopN), etc. And they solve these prob= lems in a mergeable and streaming way, which makes them suitable for real-t= ime queries. = =3D=3D Initial Goals =3D=3D We are breaking our initial goals into short-term (2-6 months) and interm= ediate to longer-term ( 6 months to 2 years): = Our short-term goals include: = - Understanding and adapting to the Apache development process and structur= es. + * Understanding and adapting to the Apache development process and struc= tures. - Start refactoring codebase and move various DataSketches repositories cod= e to Apache Git repository. + * Start refactoring codebase and move various DataSketches repositories = code to Apache Git repository. - Continue development of new features, functions, and fixes. + * Continue development of new features, functions, and fixes. - Specific sub-projects (e.g., C++ and Python) will continue to be develope= d and expanded. + * Specific sub-projects (e.g., C++ and Python) will continue to be devel= oped and expanded. = The intermediate to longer term goals include: = - Completing the design and implementation of the C++ sketches to complemen= t what is already available in Java, and the Python wrappers of those C++ s= ketches. + * Completing the design and implementation of the C++ sketches to comple= ment what is already available in Java, and the Python wrappers of those C+= + sketches. - Expanding the C++ build framework to include Windows and the popular Linu= x variants. + * Expanding the C++ build framework to include Windows and the popular L= inux variants. - Continued engagement with the scientific research community on the develo= pment of new algorithms for computationally difficult problems that heretof= ore have not had a sketching solution. + * Continued engagement with the scientific research community on the dev= elopment of new algorithms for computationally difficult problems that here= tofore have not had a sketching solution. = =3D=3D Current Status =3D=3D The DataSketches GitHub project has been quite successful. As of this wr= iting (Feb, 2019) the number of downloads measured by the Nexus Repository = Manager at https://oss.sonatype.org has grown by nearly a factor of 10 over= the past year to about 55 thousand per month. The DataSketches/sketches-co= re repository has about 560 stars and 141 forks, which is pretty good for a= highly specialized library. = =3D=3D Development Practices =3D=3D =3D=3D=3D Source Control =3D=3D=3D - All of our developers have extensive experience with Git version control = and follow accepted practices for use of Pull Requests (PRs), code reviews = and commits to master, for example. = + All of our developers have extensive experience with Git version control = and follow accepted practices for use of Pull Requests (PRs), code reviews = and commits to master, for example. = - = - ---- /!\ '''Edit conflict - other version:''' ---- = =3D=3D=3D Testing =3D=3D=3D = Sketches, by their nature are probabilistic programs and don=E2=80=99t ne= cessarily behave deterministically. For some of the sketches we intentiona= lly insert random noise into the code as this gives us the mathematical pro= perties that we need to guarantee accuracy. This can make the behavior of = these algorithms quite unintuitive and provides significant challenges to t= he developer who wishes to test these algorithms for correctness. As a resu= lt, our testing strategy includes two major components: unit tests, and cha= racterization tests. = @@ -120, +119 @@ =3D=3D=3D Unit Testing =3D=3D=3D Our unit tests are primarily quick tests to make sure that we exercise al= l critical paths in the code and that key branches are executed correctly. = It is important that they execute relatively fast as they are generally run= on every code build. The sketches-core repository alone has about 22 thous= and statements, over 1300 unit tests and code coverage of about 98.2% as me= asured by Atlassian/Clover. It is our goal for all of our code repositorie= s that are used in production that they have code coverage greater than 90%. = - = - ---- /!\ '''Edit conflict - your version:''' ---- - = - =3D=3D=3D Testing =3D=3D=3D = - Sketches, by their nature are probabilistic programs and don=E2=80=99t ne= cessarily behave deterministically. For some of the sketches we intentiona= lly insert random noise into the code as this gives us the mathematical pro= perties that we need to guarantee accuracy. This can make the behavior of = these algorithms quite unintuitive and provides significant challenges to t= he developer who wishes to test these algorithms for correctness. As a resu= lt, our testing strategy includes two major components: unit tests, and cha= racterization tests. = - = - =3D=3D=3D Unit Testing =3D=3D=3D - Our unit tests are primarily quick tests to make sure that we exercise al= l critical paths in the code and that key branches are executed correctly. = It is important that they execute relatively fast as they are generally run= on every code build. The sketches-core repository alone has about 22 thous= and statements, over 1300 unit tests and code coverage of about 98.2% as me= asured by Atlassian/Clover. It is our goal for all of our code repositorie= s that are used in production that they have code coverage greater than 90%. - = - = - ---- /!\ '''End of edit conflict''' ---- =3D=3D=3D Characterization Testing =3D=3D=3D In order to test the probabilistic methods that are used to interpret the= stochastic behaviors of our sketches we have a separate characterization r= epository that is dedicated to this. To measure accuracy, for example, req= uires running thousands of trials at each of many different points along th= e domain axis. Each trial compares its estimated results against a known ex= act result producing an error for that trial. These error measurements are= then fed into our Quantiles sketch to capture the actual distribution of e= rror at that point along the axis. We then select quantile contours across = all the distributions at points along the axis. These contours can then be= plotted to reveal the shape of the actual error distribution. These distri= butions are not at all Gaussian, in fact they can be quite complex. Noneth= eless, these distributions are then checked against our statistical guarant= ees inherent to the specific sketch algorithm and its parameters. There are= many examples of these characterization error distributions on our website= . The runtimes of these tests can be very long and can range from many minu= tes to hours, and some can run for days. Currently, we have separate chara= cterization repositories for Java and C++ / Python. = =20 --------------------------------------------------------------------- To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org For additional commands, e-mail: cvs-help@incubator.apache.org