From cvs-return-34049-archive-asf-public=cust-asf.ponee.io@incubator.apache.org Sat Mar 2 02:23:13 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 68544180647 for ; Sat, 2 Mar 2019 03:23:12 +0100 (CET) Received: (qmail 82841 invoked by uid 500); 2 Mar 2019 02:23:11 -0000 Mailing-List: contact cvs-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list cvs@incubator.apache.org Received: (qmail 82831 invoked by uid 99); 2 Mar 2019 02:23:11 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Mar 2019 02:23:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 5722CC56F8 for ; Sat, 2 Mar 2019 02:23:09 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -9.8 X-Spam-Level: X-Spam-Status: No, score=-9.8 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_NUMSUBJECT=0.5, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, USER_IN_DEF_SPF_WL=-7.5] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id H3S3KbWgjCzW for ; Sat, 2 Mar 2019 02:23:07 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 3973460DB3 for ; Sat, 2 Mar 2019 02:23:07 +0000 (UTC) Received: from moin-vm.apache.org (moin-vm.apache.org [163.172.69.106]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 48792E01AC for ; Sat, 2 Mar 2019 02:23:06 +0000 (UTC) Received: from moin-vm.apache.org (localhost [IPv6:::1]) by moin-vm.apache.org (ASF Mail Server at moin-vm.apache.org) with ESMTP id 7FFA08001E for ; Sat, 2 Mar 2019 02:23:05 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Sat, 02 Mar 2019 02:23:04 -0000 Message-ID: <155149338481.4698.16043284452197144468@moin-vm.apache.org> Subject: =?utf-8?q?=5BIncubator_Wiki=5D_Update_of_=22DataSketchesPorposal=22_by_ch?= =?utf-8?q?enliang613?= Auto-Submitted: auto-generated Dear Wiki user, You have subscribed to a wiki page or wiki category on "Incubator Wiki" for= change notification. The "DataSketchesPorposal" page has been changed by chenliang613: https://wiki.apache.org/incubator/DataSketchesPorposal?action=3Ddiff&rev1= =3D1&rev2=3D2 =3D=3D Development Practices =3D=3D =3D=3D=3D Source Control =3D=3D=3D All of our developers have extensive experience with Git version control = and follow accepted practices for use of Pull Requests (PRs), code reviews = and commits to master, for example. = + = =3D=3D=3D Testing =3D=3D=3D = Sketches, by their nature are probabilistic programs and don=E2=80=99t ne= cessarily behave deterministically. For some of the sketches we intentiona= lly insert random noise into the code as this gives us the mathematical pro= perties that we need to guarantee accuracy. This can make the behavior of = these algorithms quite unintuitive and provides significant challenges to t= he developer who wishes to test these algorithms for correctness. As a resu= lt, our testing strategy includes two major components: unit tests, and cha= racterization tests. = + = =3D=3D=3D Unit Testing =3D=3D=3D Our unit tests are primarily quick tests to make sure that we exercise al= l critical paths in the code and that key branches are executed correctly. = It is important that they execute relatively fast as they are generally run= on every code build. The sketches-core repository alone has about 22 thous= and statements, over 1300 unit tests and code coverage of about 98.2% as me= asured by Atlassian/Clover. It is our goal for all of our code repositorie= s that are used in production that they have code coverage greater than 90%. + = =3D=3D=3D Characterization Testing =3D=3D=3D In order to test the probabilistic methods that are used to interpret the= stochastic behaviors of our sketches we have a separate characterization r= epository that is dedicated to this. To measure accuracy, for example, req= uires running thousands of trials at each of many different points along th= e domain axis. Each trial compares its estimated results against a known ex= act result producing an error for that trial. These error measurements are= then fed into our Quantiles sketch to capture the actual distribution of e= rror at that point along the axis. We then select quantile contours across = all the distributions at points along the axis. These contours can then be= plotted to reveal the shape of the actual error distribution. These distri= butions are not at all Gaussian, in fact they can be quite complex. Noneth= eless, these distributions are then checked against our statistical guarant= ees inherent to the specific sketch algorithm and its parameters. There are= many examples of these characterization error distributions on our website= . The runtimes of these tests can be very long and can range from many minu= tes to hours, and some can run for days. Currently, we have separate chara= cterization repositories for Java and C++ / Python. = = @@ -135, +138 @@ The core developers and contributors for DataSketches are from diverse ba= ckgrounds, but primarily are scientists that love engineering and engineers= that love science. A large part of the value we bring comes from this synt= hesis. These individuals have already contributed substantially to the cod= e, algorithms, and/or mathematical proofs that form the basis of the librar= y. = = This core group also form the Initial Committers with write permissions t= o the repository. Those marked with (*) Meet weekly to plan the research an= d engineering direction of the project. + = =3D=3D=3D Scientists That Love Engineering =3D=3D=3D - - Eshcar Hillel: Senior Research Scientist, Yahoo Labs, Israel. Interests= : distributed systems, scalable systems and platforms for big data processi= ng, concurrent algorithms and data structures, = + * Eshcar Hillel: Senior Research Scientist, Yahoo Labs, Israel. Interests= : distributed systems, scalable systems and platforms for big data processi= ng, concurrent algorithms and data structures, = - - Kevin Lang: (*) Distinguished Research Scientist, Yahoo Labs, Sunnyvale= , California. Interests: algorithms, theoretical and applied mathematics, e= ncoding and compression theory, theoretical and applied performance optimiz= ation. + * Kevin Lang: (*) Distinguished Research Scientist, Yahoo Labs, Sunnyvale= , California. Interests: algorithms, theoretical and applied mathematics, e= ncoding and compression theory, theoretical and applied performance optimiz= ation. - - Edo Liberty: (*) Director of Research, Head of Amazon AI Labs, Palo Alt= o, California. Manages the algorithms group at Amazon AI. We build scalable= machine learning systems and algorithms which are used both internally and= externally by customers of SageMaker, AWS's flagship machine learning plat= form. = + * Edo Liberty: (*) Director of Research, Head of Amazon AI Labs, Palo Alt= o, California. Manages the algorithms group at Amazon AI. We build scalable= machine learning systems and algorithms which are used both internally and= externally by customers of SageMaker, AWS's flagship machine learning plat= form. = - - Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale. Interests: Com= putational advertising, machine learning, speech recognition, data-driven a= nalysis, large scale experimentation, big data, stream/complex event proces= sing + * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale. Interests: Com= putational advertising, machine learning, speech recognition, data-driven a= nalysis, large scale experimentation, big data, stream/complex event proces= sing - - Justin Thaler: (*) Assistant Professor, Department of Computer Science,= Georgetown University, Washington D.C. Interests: algorithms and computati= onal complexity, complexity theory, quantum algorithms, private data analys= is, and learning theory, developing efficient streaming and sketching algor= ithms + * Justin Thaler: (*) Assistant Professor, Department of Computer Science,= Georgetown University, Washington D.C. Interests: algorithms and computati= onal complexity, complexity theory, quantum algorithms, private data analys= is, and learning theory, developing efficient streaming and sketching algor= ithms = =3D=3D=3D Engineers That Love Science =3D=3D=3D - - Roman Leventov: Senior Software Engineer, Metamarkets / Snap. Interest= s: design and implementation of data storing and data processing (distribut= ed) systems, performance optimization, CPU performance, mechanical sympathy= , JVM performance, API design, databases, (concurrent) data structures, mem= ory management, garbage collection algorithms, language design and runtimes= (their tradeoffs), distributed systems (cloud) efficiency, Linux, code qua= lity, code transformation, pure functional programming models, Haskell. + * Roman Leventov: Senior Software Engineer, Metamarkets / Snap. Interest= s: design and implementation of data storing and data processing (distribut= ed) systems, performance optimization, CPU performance, mechanical sympathy= , JVM performance, API design, databases, (concurrent) data structures, mem= ory management, garbage collection algorithms, language design and runtimes= (their tradeoffs), distributed systems (cloud) efficiency, Linux, code qua= lity, code transformation, pure functional programming models, Haskell. - - Lee Rhodes: (*) Distinguished Architect, lead developer and founder of = the DataSketches project, Yahoo, Sunnyvale, California. Interests: streami= ng algorithms, mathematics, computer science, high quality and high perform= ance code for the analysis of massive data, bridging the divide between the= ory and practice. + * Lee Rhodes: (*) Distinguished Architect, lead developer and founder of = the DataSketches project, Yahoo, Sunnyvale, California. Interests: streami= ng algorithms, mathematics, computer science, high quality and high perform= ance code for the analysis of massive data, bridging the divide between the= ory and practice. - - Alexander Saydakov: (*) Senior Software Engineer, Yahoo, Sunnyvale, Cal= ifornia. Interests: applied mathematics, computer science, big data, distri= buted systems. + * Alexander Saydakov: (*) Senior Software Engineer, Yahoo, Sunnyvale, Cal= ifornia. Interests: applied mathematics, computer science, big data, distri= buted systems. = =3D=3D Introduction to Additional Interested Contributors =3D=3D These folks have been intermittently involved and contributed, but are st= rong supporters of this project. = - - Frank Grimes: GitHub ID: frankgrimes97 + * Frank Grimes: GitHub ID: frankgrimes97 - - Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D. Computer Science,= Univ of Utah. Interests: Machine Learning, Data Mining, matrix approximati= on, streaming algorithms, randomized linear algebra. + * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D. Computer Science,= Univ of Utah. Interests: Machine Learning, Data Mining, matrix approximati= on, streaming algorithms, randomized linear algebra. - - Christopher Musco: [christopher.musco at gmail dot com] Ph.D. Computer = Science, Research Instructor, Princeton University. Interests: algorithmic = foundations of data science and machine learning, efficient methods for pro= cessing and understanding large datasets, often working at the intersection= of theoretical computer science, numerical linear algebra, and optimizatio= n. + * Christopher Musco: [christopher.musco at gmail dot com] Ph.D. Computer = Science, Research Instructor, Princeton University. Interests: algorithmic = foundations of data science and machine learning, efficient methods for pro= cessing and understanding large datasets, often working at the intersection= of theoretical computer science, numerical linear algebra, and optimizatio= n. - - Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D. Computer Science= , Professor, Warwick University, Warwick, England. Interests: all aspects o= f the "data lifecycle", from data collection and cleaning, through mining a= nd analytics. (Professor Cormode is one of the world=E2=80=99s leading scie= ntists in sketching algorithms) + * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D. Computer Science= , Professor, Warwick University, Warwick, England. Interests: all aspects o= f the "data lifecycle", from data collection and cleaning, through mining a= nd analytics. (Professor Cormode is one of the world=E2=80=99s leading scie= ntists in sketching algorithms) = =3D=3D Alignment =3D=3D The DataSketches library already provides integrations and example code f= or Apache Hive, Apache Pig, Apache Spark and is deeply integrated into Apac= he Druid. = + = =3D=3D Known Risks =3D=3D The following subsections are specific risks that have been identified by= the ASF that need to be addressed. + = =3D=3D=3D Risk: Orphaned Products =3D=3D=3D The DataSketches library is presently used by a number of organizations, = from small startups to Fortune 100 companies, to construct production pipel= ines that must process and analyze massive data. Yahoo has a long-term comm= itment to continue to advance the DataSketches library; moreover, DataSketc= hes is seeing increasing interest, development, and adoption from many dive= rse organizations from around the world. Due to its growing adoption, we fe= el it is quite unlikely that this project would become orphaned. + = =3D=3D=3D Risk: Inexperience with Open Source =3D=3D=3D Yahoo believes strongly in open source and the exchange of information to= advance new ideas and work. Examples of this commitment are active open so= urce projects such as those mentioned above. With DataSketches, we have bee= n increasingly open and forward-looking; we have published a number of pape= rs about breakthrough developments in the science of streaming algorithms (= mentioned above) that also reference the DataSketches library. Our submiss= ion to the Apache Software Foundation is a logical extension of our commitm= ent to open source software. = = Key committers at Yahoo with strong open source backgrounds include Aaron= Gresch, Alan Carroll, Alessandro Bellina, Anastasia Braginsky, Andrews Sah= aya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan Call, Daryn Sharp,= Dav Glass, David Carlin, Derek Dagit, Eric Payne, Eshcar Hillel, Ethan Li,= Fei Deng, Francis Christopher Liu, Francisco Perez-Sorrosal, Gil Yehuda. G= ovind Menon, Hang Yang, Jacob Estelle, Jai Asher, James Penick, Jason Kenny= , Jay Pipes, Jim Rollenhagen, Joe Francis, Jon Eagles, Kihwal Lee, Kishorku= mar Patil, Koji Noguchi, Kuhu Shukla, Michael Trelinski, Mithun Radhakrishn= an, Nathan Roberts, Ohad Shacham, Olga L. Natkovich, Parth Kamlesh Gandhi, = Rajan Dhabalia, Rohini Palaniswamy, Ruby Loo, Ryan Bridges, Sanket Chintapa= lli, Satish Subhashrao Saley, Shu Kit Chan, Sri Harsha Mekala, Susan Hinric= hs, Yonatan Gottesman, and many more. = All of our core developers are committed to learn about the Apache proces= s and to give back to the community. = + = =3D=3D=3D Risk: Homogeneous Developers =3D=3D=3D The majority of committers in this proposal belong to Yahoo due to the fa= ct that DataSketches has emerged from an internal Yahoo project. This propo= sal also includes developers and contributors from other companies, and who= are actively involved with other Apache projects, such as Druid. We expec= t our entry into incubation will allow us to expand the number of individua= ls and organizations participating in DataSketches development. + = =3D=3D=3D Risk: Reliance on Salaried Developers =3D=3D=3D Because the DataSketches library originated within Yahoo, it has been dev= eloped primarily by salaried Yahoo developers and we expect that to continu= e to be the case near term. However, since we placed this library into open= -source we have had a number of significant contributions from engineers an= d scientists from outside of Yahoo. We expect our reliance on Yahoo salarie= d developers will decrease over time. Nonetheless, Yahoo is committed to co= ntinue its strong support of this important project. + = =3D=3D=3D Risk: Lack of Relationship to other Apache Products =3D=3D=3D DataSketches already directly interoperates with or utilizes several exis= ting Apache projects. = = - -Build + * Build - - Apache Maven + ** Apache Maven - - Integrations and adaptors for the following projects naturally have the= m as dependencies + * Integrations and adaptors for the following projects naturally have the= m as dependencies - - Apache Hive + ** Apache Hive - - Apache Pig + ** Apache Pig - - Apache Druid + ** Apache Druid - - Apache Spark + ** Apache Spark - - Additional dependencies for the above integrations and adaptors include + * Additional dependencies for the above integrations and adaptors include - - Apache Hadoop + ** Apache Hadoop - - Apache Commons (Math) + ** Apache Commons (Math) = There is no other Apache project that we are aware of that duplicates the= functionality of the DataSketches library. =3D=3D=3D Risk: An Excessive Fascination with the Apache Brand =3D=3D=3D --------------------------------------------------------------------- To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org For additional commands, e-mail: cvs-help@incubator.apache.org