incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "DataSketchesPorposal" by chenliang613
Date Sat, 02 Mar 2019 02:24:05 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "DataSketchesPorposal" page has been changed by chenliang613:
https://wiki.apache.org/incubator/DataSketchesPorposal?action=diff&rev1=2&rev2=3

  === Source Control ===
  All of our developers have extensive experience with Git version control and follow accepted
practices for use of Pull Requests (PRs), code reviews and commits to master, for example.
 
  
+ ---- /!\ '''Edit conflict - other version:''' ----
+ 
  === Testing === 
  Sketches, by their nature are probabilistic programs and don’t necessarily behave deterministically.
 For some of the sketches we intentionally insert random noise into the code as this gives
us the mathematical properties that we need to guarantee accuracy.  This can make the behavior
of these algorithms quite unintuitive and provides significant challenges to the developer
who wishes to test these algorithms for correctness. As a result, our testing strategy includes
two major components: unit tests, and characterization tests.  
  
  === Unit Testing ===
  Our unit tests are primarily quick tests to make sure that we exercise all critical paths
in the code and that key branches are executed correctly. It is important that they execute
relatively fast as they are generally run on every code build. The sketches-core repository
alone has about 22 thousand statements, over 1300 unit tests and code coverage of about 98.2%
as measured by Atlassian/Clover.  It is our goal for all of our code repositories that are
used in production that they have code coverage greater than 90%.
  
+ 
+ ---- /!\ '''Edit conflict - your version:''' ----
+ 
+ === Testing === 
+ Sketches, by their nature are probabilistic programs and don’t necessarily behave deterministically.
 For some of the sketches we intentionally insert random noise into the code as this gives
us the mathematical properties that we need to guarantee accuracy.  This can make the behavior
of these algorithms quite unintuitive and provides significant challenges to the developer
who wishes to test these algorithms for correctness. As a result, our testing strategy includes
two major components: unit tests, and characterization tests.  
+ 
+ === Unit Testing ===
+ Our unit tests are primarily quick tests to make sure that we exercise all critical paths
in the code and that key branches are executed correctly. It is important that they execute
relatively fast as they are generally run on every code build. The sketches-core repository
alone has about 22 thousand statements, over 1300 unit tests and code coverage of about 98.2%
as measured by Atlassian/Clover.  It is our goal for all of our code repositories that are
used in production that they have code coverage greater than 90%.
+ 
+ 
+ ---- /!\ '''End of edit conflict''' ----
  === Characterization Testing ===
  In order to test the probabilistic methods that are used to interpret the stochastic behaviors
of our sketches we have a separate characterization repository that is dedicated to this.
 To measure accuracy, for example, requires running thousands of trials at each of many different
points along the domain axis. Each trial compares its estimated results against a known exact
result producing an error for that trial.  These error measurements are then fed into our
Quantiles sketch to capture the actual distribution of error at that point along the axis.
We then select quantile contours across all the distributions at points along the axis.  These
contours can then be plotted to reveal the shape of the actual error distribution. These distributions
are not at all Gaussian, in fact they can be quite complex.  Nonetheless, these distributions
are then checked against our statistical guarantees inherent to the specific sketch algorithm
and its parameters. There are many examples of these characterization error distributions
on our website. The runtimes of these tests can be very long and can range from many minutes
to hours, and some can run for days.  Currently, we have separate characterization repositories
for Java and C++ / Python.   
  
@@ -138, +151 @@

  The core developers and contributors for DataSketches are from diverse backgrounds, but
primarily are scientists that love engineering and engineers that love science. A large part
of the value we bring comes from this synthesis.  These individuals have already contributed
substantially to the code, algorithms, and/or mathematical proofs that form the basis of the
library.  
  
  This core group also form the Initial Committers with write permissions to the repository.
Those marked with (*) Meet weekly to plan the research and engineering direction of the project.
+ 
+ ---- /!\ '''Edit conflict - other version:''' ----
  
  === Scientists That Love Engineering ===
  * Eshcar Hillel: Senior Research Scientist, Yahoo Labs, Israel. Interests: distributed systems,
scalable systems and platforms for big data processing, concurrent algorithms and data structures,

@@ -151, +166 @@

  * Lee Rhodes: (*) Distinguished Architect, lead developer and founder of the DataSketches
project, Yahoo, Sunnyvale, California.  Interests: streaming algorithms, mathematics, computer
science, high quality and high performance code for the analysis of massive data, bridging
the divide between theory and practice.
  * Alexander Saydakov: (*) Senior Software Engineer, Yahoo, Sunnyvale, California. Interests:
applied mathematics, computer science, big data, distributed systems.
  
+ ---- /!\ '''Edit conflict - your version:''' ----
+ 
+ === Scientists That Love Engineering ===
+ * Eshcar Hillel: Senior Research Scientist, Yahoo Labs, Israel. Interests: distributed systems,
scalable systems and platforms for big data processing, concurrent algorithms and data structures,

+ * Kevin Lang: (*) Distinguished Research Scientist, Yahoo Labs, Sunnyvale, California. Interests:
algorithms, theoretical and applied mathematics, encoding and compression theory, theoretical
and applied performance optimization.
+ * Edo Liberty: (*) Director of Research, Head of Amazon AI Labs, Palo Alto, California.
Manages the algorithms group at Amazon AI. We build scalable machine learning systems and
algorithms which are used both internally and externally by customers of SageMaker, AWS's
flagship machine learning platform. 
+ * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale. Interests: Computational advertising,
machine learning, speech recognition, data-driven analysis, large scale experimentation, big
data, stream/complex event processing
+ * Justin Thaler: (*) Assistant Professor, Department of Computer Science, Georgetown University,
Washington D.C. Interests: algorithms and computational complexity, complexity theory, quantum
algorithms, private data analysis, and learning theory, developing efficient streaming and
sketching algorithms
+ 
+ === Engineers That Love Science ===
+ * Roman Leventov: Senior Software Engineer,  Metamarkets / Snap. Interests: design and implementation
of data storing and data processing (distributed) systems, performance optimization, CPU performance,
mechanical sympathy, JVM performance, API design, databases, (concurrent) data structures,
memory management, garbage collection algorithms, language design and runtimes (their tradeoffs),
distributed systems (cloud) efficiency, Linux, code quality, code transformation, pure functional
programming models, Haskell.
+ * Lee Rhodes: (*) Distinguished Architect, lead developer and founder of the DataSketches
project, Yahoo, Sunnyvale, California.  Interests: streaming algorithms, mathematics, computer
science, high quality and high performance code for the analysis of massive data, bridging
the divide between theory and practice.
+ * Alexander Saydakov: (*) Senior Software Engineer, Yahoo, Sunnyvale, California. Interests:
applied mathematics, computer science, big data, distributed systems.
+ 
+ ---- /!\ '''End of edit conflict''' ----
+ 
  == Introduction to Additional Interested Contributors ==
  These folks have been intermittently involved and contributed, but are strong supporters
of this project.
  
+ 
+ ---- /!\ '''Edit conflict - other version:''' ----
  * Frank Grimes: GitHub ID: frankgrimes97
  * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D. Computer Science, Univ of Utah.
Interests: Machine Learning, Data Mining, matrix approximation, streaming algorithms, randomized
linear algebra.
  * Christopher Musco: [christopher.musco at gmail dot com] Ph.D. Computer Science, Research
Instructor, Princeton University. Interests: algorithmic foundations of data science and machine
learning, efficient methods for processing and understanding large datasets, often working
at the intersection of theoretical computer science, numerical linear algebra, and optimization.
  * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D. Computer Science, Professor, Warwick
University, Warwick, England. Interests: all aspects of the "data lifecycle", from data collection
and cleaning, through mining and analytics. (Professor Cormode is one of the world’s leading
scientists in sketching algorithms)
  
+ ---- /!\ '''Edit conflict - your version:''' ----
+ * Frank Grimes: GitHub ID: frankgrimes97
+ * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D. Computer Science, Univ of Utah.
Interests: Machine Learning, Data Mining, matrix approximation, streaming algorithms, randomized
linear algebra.
+ * Christopher Musco: [christopher.musco at gmail dot com] Ph.D. Computer Science, Research
Instructor, Princeton University. Interests: algorithmic foundations of data science and machine
learning, efficient methods for processing and understanding large datasets, often working
at the intersection of theoretical computer science, numerical linear algebra, and optimization.
+ * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D. Computer Science, Professor, Warwick
University, Warwick, England. Interests: all aspects of the "data lifecycle", from data collection
and cleaning, through mining and analytics. (Professor Cormode is one of the world’s leading
scientists in sketching algorithms)
+ 
+ ---- /!\ '''End of edit conflict''' ----
+ 
  == Alignment ==
  The DataSketches library already provides integrations and example code for Apache Hive,
Apache Pig, Apache Spark and is deeply integrated into Apache Druid. 
  
+ ---- /!\ '''Edit conflict - other version:''' ----
+ 
  == Known Risks ==
  The following subsections are specific risks that have been identified by the ASF that need
to be addressed.
  
  === Risk: Orphaned Products ===
  The DataSketches library is presently used by a number of organizations, from small startups
to Fortune 100 companies, to construct production pipelines that must process and analyze
massive data. Yahoo has a long-term commitment to continue to advance the DataSketches library;
moreover, DataSketches is seeing increasing interest, development, and adoption from many
diverse organizations from around the world. Due to its growing adoption, we feel it is quite
unlikely that this project would become orphaned.
  
+ 
+ ---- /!\ '''Edit conflict - your version:''' ----
+ 
+ == Known Risks ==
+ The following subsections are specific risks that have been identified by the ASF that need
to be addressed.
+ 
+ === Risk: Orphaned Products ===
+ The DataSketches library is presently used by a number of organizations, from small startups
to Fortune 100 companies, to construct production pipelines that must process and analyze
massive data. Yahoo has a long-term commitment to continue to advance the DataSketches library;
moreover, DataSketches is seeing increasing interest, development, and adoption from many
diverse organizations from around the world. Due to its growing adoption, we feel it is quite
unlikely that this project would become orphaned.
+ 
+ 
+ ---- /!\ '''End of edit conflict''' ----
  === Risk: Inexperience with Open Source ===
  Yahoo believes strongly in open source and the exchange of information to advance new ideas
and work. Examples of this commitment are active open source projects such as those mentioned
above. With DataSketches, we have been increasingly open and forward-looking; we have published
a number of papers about breakthrough developments in the science of streaming algorithms
(mentioned above) that also reference the DataSketches library.  Our submission to the Apache
Software Foundation is a logical extension of our commitment to open source software. 
  
@@ -175, +229 @@

  
  All of our core developers are committed to learn about the Apache process and to give back
to the community. 
  
+ ---- /!\ '''Edit conflict - other version:''' ----
+ 
  === Risk: Homogeneous Developers ===
  The majority of committers in this proposal belong to Yahoo due to the fact that DataSketches
has emerged from an internal Yahoo project. This proposal also includes developers and contributors
from other companies, and who are actively involved with other Apache projects, such as Druid.
 We expect our entry into incubation will allow us to expand the number of individuals and
organizations participating in DataSketches development.
  
  === Risk: Reliance on Salaried Developers ===
  Because the DataSketches library originated within Yahoo, it has been developed primarily
by salaried Yahoo developers and we expect that to continue to be the case near term. However,
since we placed this library into open-source we have had a number of significant contributions
from engineers and scientists from outside of Yahoo. We expect our reliance on Yahoo salaried
developers will decrease over time. Nonetheless, Yahoo is committed to continue its strong
support of this important project.
  
+ 
+ ---- /!\ '''Edit conflict - your version:''' ----
+ 
+ === Risk: Homogeneous Developers ===
+ The majority of committers in this proposal belong to Yahoo due to the fact that DataSketches
has emerged from an internal Yahoo project. This proposal also includes developers and contributors
from other companies, and who are actively involved with other Apache projects, such as Druid.
 We expect our entry into incubation will allow us to expand the number of individuals and
organizations participating in DataSketches development.
+ 
+ === Risk: Reliance on Salaried Developers ===
+ Because the DataSketches library originated within Yahoo, it has been developed primarily
by salaried Yahoo developers and we expect that to continue to be the case near term. However,
since we placed this library into open-source we have had a number of significant contributions
from engineers and scientists from outside of Yahoo. We expect our reliance on Yahoo salaried
developers will decrease over time. Nonetheless, Yahoo is committed to continue its strong
support of this important project.
+ 
+ 
+ ---- /!\ '''End of edit conflict''' ----
  === Risk: Lack of Relationship to other Apache Products ===
  DataSketches already directly interoperates with or utilizes several existing Apache projects.

  
+ 
+ ---- /!\ '''Edit conflict - other version:''' ----
  * Build
   ** Apache Maven
  * Integrations and adaptors for the following projects naturally have them as dependencies
@@ -194, +263 @@

  * Additional dependencies for the above integrations and adaptors include
   ** Apache Hadoop
   ** Apache Commons (Math)
+ 
+ ---- /!\ '''Edit conflict - your version:''' ----
+ * Build
+  ** Apache Maven
+ * Integrations and adaptors for the following projects naturally have them as dependencies
+  ** Apache Hive
+  ** Apache Pig
+  ** Apache Druid
+  ** Apache Spark
+ * Additional dependencies for the above integrations and adaptors include
+  ** Apache Hadoop
+  ** Apache Commons (Math)
+ 
+ ---- /!\ '''End of edit conflict''' ----
  
  There is no other Apache project that we are aware of that duplicates the functionality
of the DataSketches library.
  === Risk: An Excessive Fascination with the Apache Brand ===

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message