incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "DataSketchesPorposal" by chenliang613
Date Sat, 02 Mar 2019 03:45:15 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "DataSketchesPorposal" page has been changed by chenliang613:
https://wiki.apache.org/incubator/DataSketchesPorposal?action=diff&rev1=10&rev2=11

  
  There is no other Apache project that we are aware of that duplicates the functionality
of the DataSketches library.
  
+ === Risk: An Excessive Fascination with the Apache Brand ===
+ With this proposal we are not seeking attention or publicity. Rather, we firmly believe
in the DataSketches library and concept and the ability to make the DataSketches library a
powerful, yet simple-to-use toolkit for data processing. While the DataSketches library has
been open source, we believe putting code on GitHub can only go so far. We see the Apache
community, processes, and mission as critical for ensuring the DataSketches library is truly
community-driven, positively impactful, and innovative open source software. While Yahoo has
taken a number of steps to advance its various open source projects, we believe the DataSketches
library project is a great fit for the Apache Software Foundation due to its focus on data
processing and its relationships to existing ASF projects.
+ 
+ === Risk: Cryptography ===
+ DataSketches does not contain any cryptographic code and is not a cryptographic product.
+ 
  == Documentation ==
  The following documentation is relevant to this proposal. Relevant portions of the documentation
will be contributed to the Apache DataSketches project.
- - DataSketches website: https://datasketches.github.io.  
+  * DataSketches website: https://datasketches.github.io.  
- - DataSketches website repository: https://github.com/DataSketches/DataSketches.github.io

+  * DataSketches website repository: https://github.com/DataSketches/DataSketches.github.io

  
  We will need an apache website for this documentation similar to
- - https://datasketches.apache.org
+  * https://datasketches.apache.org
  
  == Initial Source ==
  The initial source for DataSketches which we will submit to the Apache Foundation will include
a number of repositories which are currently hosted under the GitHub.com/datasketches organization:

  
  All github.com/datasketches repositories including:
  
- - Java Production Code
+  * Java Production Code
-  - sketches-core: This repository has the core sketching classes, which are leveraged by
some of the other repositories. This repository has no external dependencies outside of the
DataSketches/memory repository, Java and TestNG for unit tests. This code is versioned and
the latest release can be obtained from Maven Central.
+   - sketches-core: This repository has the core sketching classes, which are leveraged by
some of the other repositories. This repository has no external dependencies outside of the
DataSketches/memory repository, Java and TestNG for unit tests. This code is versioned and
the latest release can be obtained from Maven Central.
-  - memory: Low level, high-performance memory data-structure management primarily for off-heap.
This code is versioned and the latest release can be obtained from Maven Central.
+   - memory: Low level, high-performance memory data-structure management primarily for off-heap.
This code is versioned and the latest release can be obtained from Maven Central.
-  - sketches-android: This is a new repository dedicated to sketches designed to be run in
a mobile client, such as a cell phone or IoT device. It should be considered experimental.
 It is not currently versioned or published to Maven Central.
+   - sketches-android: This is a new repository dedicated to sketches designed to be run
in a mobile client, such as a cell phone or IoT device. It should be considered experimental.
 It is not currently versioned or published to Maven Central.
-  - sketches-hive: This repository contains Hive UDFs and UDAFs for use within Hadoop grid
environments. This code has dependencies on sketches-core as well as Hadoop and Hive. Users
of this code are advised to use Maven to bring in all the required dependencies. This code
is versioned and the latest release can be obtained from Maven Central.
+   - sketches-hive: This repository contains Hive UDFs and UDAFs for use within Hadoop grid
environments. This code has dependencies on sketches-core as well as Hadoop and Hive. Users
of this code are advised to use Maven to bring in all the required dependencies. This code
is versioned and the latest release can be obtained from Maven Central.
-  - sketches-pig: This repository contains Pig User Defined Functions (UDF) for use within
Hadoop grid environments. This code has dependencies on 
+   - sketches-pig: This repository contains Pig User Defined Functions (UDF) for use within
Hadoop grid environments. This code has dependencies on 
-  - sketches-core as well as Hadoop and Pig. Users of this code are advised to use Maven
to bring in all the required dependencies. This code is versioned and the latest release can
be obtained from Maven Central.
+   - sketches-core as well as Hadoop and Pig. Users of this code are advised to use Maven
to bring in all the required dependencies. This code is versioned and the latest release can
be obtained from Maven Central.
-  - sketches-vector: This is a new repository dedicated to sketches for vector and matrix
operations. It is still somewhat experimental. It is versioned and published to Maven Central.
+   - sketches-vector: This is a new repository dedicated to sketches for vector and matrix
operations. It is still somewhat experimental. It is versioned and published to Maven Central.
  
- - Java Non-Production Code
+  * Java Non-Production Code
-  - characterization: This relatively new repository is for code that we use to characterize
the accuracy and speed performance of the sketches in the library and is constantly being
updated. Examples of the job command files used for various tests can be found in the src/main/resources
directory. Some of these tests can run for hours depending on its configuration. This code
is not versioned and not published to Maven Central.
+   - characterization: This relatively new repository is for code that we use to characterize
the accuracy and speed performance of the sketches in the library and is constantly being
updated. Examples of the job command files used for various tests can be found in the src/main/resources
directory. Some of these tests can run for hours depending on its configuration. This code
is not versioned and not published to Maven Central.
-  - experimental: This repository is an experimental staging area for code that may eventually
end up in another repository. This code is not versioned and not published to Maven Central.
+   - experimental: This repository is an experimental staging area for code that may eventually
end up in another repository. This code is not versioned and not published to Maven Central.
-  - sketches-misc: Demos and other code not related to production deployment. We have no
plans to publish this to Maven Central in the future.
+   - sketches-misc: Demos and other code not related to production deployment. We have no
plans to publish this to Maven Central in the future.
  
- - C++ and Python Production Code
+  * C++ and Python Production Code
-  - sketches-core-cpp: This is the C++/Python companion to the Java sketches-core. These
implementations are binary compatible with their counterparts in Java. In other words, a sketch
created and stored in C++ can be opened and read in Java and visa-versa. This site also has
our Python adaptors that basically wrap the C++ implementations, making the high performance
C++ implementations available from Python.  This code will be versioned.
+   - sketches-core-cpp: This is the C++/Python companion to the Java sketches-core. These
implementations are binary compatible with their counterparts in Java. In other words, a sketch
created and stored in C++ can be opened and read in Java and visa-versa. This site also has
our Python adaptors that basically wrap the C++ implementations, making the high performance
C++ implementations available from Python.  This code will be versioned.
-  - sketches-postgres: This site provides the postgres-specific adaptors that wrap the C++
implementations making them available to the Postgres database users.  This code will be versioned.
+   - sketches-postgres: This site provides the postgres-specific adaptors that wrap the C++
implementations making them available to the Postgres database users.  This code will be versioned.
  
- - C++ and Python Non-Production Code
+  * C++ and Python Non-Production Code
-  - characterization-cpp: This is the C++/Python companion to the Java characterization repository.
This code will not be versioned.
+   - characterization-cpp: This is the C++/Python companion to the Java characterization
repository. This code will not be versioned.
-  - experimental-cpp: This repository is an experimental staging area for C++ code that will
eventually end up in another repository.  This code will not be versioned.
+   - experimental-cpp: This repository is an experimental staging area for C++ code that
will eventually end up in another repository.  This code will not be versioned.
  
- - Command-Line Tools - Non Production Code
+  * Command-Line Tools - Non Production Code
  (These may eventually be replaced by Python scripts.)
-  - sketches-cmd
+   - sketches-cmd
-  - homebrew-sketches
+   - homebrew-sketches
-  - homebrew-sketches-cmd
+   - homebrew-sketches-cmd
  
  These projects have always been Apache 2.0 licensed. We intend to bundle all of these repositories
since they are all complementary and should be maintained in one project. Prior to our submission,
we will combine all of these projects into a new git repository. 
  
@@ -249, +255 @@

  All external run-time dependencies are licensed under an Apache 2.0 or Apache-compatible
license. As we grow the DataSketches community we will configure our build process to require
and validate all contributions and dependencies are licensed under the Apache 2.0 license
or are under an Apache-compatible license.
  
  Viewing all the repositories of the current github.com/datasketches organization, there
are a number of types of external dependencies:
- - Core Java Runtime Dependencies:  these are dependencies that would have to be supplied
to run the code in the sketches-core-X.Y.Z.jar (from Maven Central), which contains all of
the sketch families. Currently there are only two:
+  * Core Java Runtime Dependencies:  these are dependencies that would have to be supplied
to run the code in the sketches-core-X.Y.Z.jar (from Maven Central), which contains all of
the sketch families. Currently there are only two:
-  - com.yahoo.datasketches/memory. This package is not technically an external dependency
since it must be included as part of the Apache DataSketches repository. 
+   - com.yahoo.datasketches/memory. This package is not technically an external dependency
since it must be included as part of the Apache DataSketches repository. 
-  - org.slf4j/slf4j-api.  A generic interface-only API that enables different logging tools
to be plugged in at runtime.  Different systems prefer different logging tools and this API
allows interfacing with many popular logging tools.  If a user does not supply a logging tool,
this API behaves as a no-op.
+   - org.slf4j/slf4j-api.  A generic interface-only API that enables different logging tools
to be plugged in at runtime.  Different systems prefer different logging tools and this API
allows interfacing with many popular logging tools.  If a user does not supply a logging tool,
this API behaves as a no-op.
- - Java Test Dependencies: 
+  * Java Test Dependencies: 
  org.testng/testng. Used for unit testing for all the java repositories
- - Java Build Dependencies: All of the Java repositories use Apache Maven, so there is a
long list of Maven dependencies plus a few others that plug into Maven such as:
+  * Java Build Dependencies: All of the Java repositories use Apache Maven, so there is a
long list of Maven dependencies plus a few others that plug into Maven such as:
- org.codehaus.mojo 
+   - org.codehaus.mojo 
- org.jacoco
+   - org.jacoco
- Java Characterization Dependencies The characterization code is not production run-time
code and for the user to inspect and run if they wish. Because this repository contains characterization
tests for algorithms from external sources by definition it has dependencies on those external
sources.
+  * Java Characterization Dependencies The characterization code is not production run-time
code and for the user to inspect and run if they wish. Because this repository contains characterization
tests for algorithms from external sources by definition it has dependencies on those external
sources.
- Java System Integration Adaptor Dependencies: The code in the sketches-pig, sketches-hive
repositories by definition rely on Apache Pig, Apache Hive and Apache Hadoop code.  
+  * Java System Integration Adaptor Dependencies: The code in the sketches-pig, sketches-hive
repositories by definition rely on Apache Pig, Apache Hive and Apache Hadoop code.  
- C++ and Python Repositories: This is still evolving, but we have tried to limit the dependencies
to the C and C++ Standard and Boost libraries.
+  * C++ and Python Repositories: This is still evolving, but we have tried to limit the dependencies
to the C and C++ Standard and Boost libraries.
- C++ System Integration Adaptor Dependencies: So far we only have an adaptor for PostgreSQL
which has dependencies on PostgreSQL.
+  * C++ System Integration Adaptor Dependencies: So far we only have an adaptor for PostgreSQL
which has dependencies on PostgreSQL.
- Ruby / Homebrew Command-Line Tool: This has dependencies on Ruby and Homebrew code (for
Mac systems).
+  * Ruby / Homebrew Command-Line Tool: This has dependencies on Ruby and Homebrew code (for
Mac systems).
- Android-based Sketches: So far, this only has dependencies on Java and no other external
dependencies.
+  * Android-based Sketches: So far, this only has dependencies on Java and no other external
dependencies.
+ 
- Required Resources
+ == Required Resources ==
- Mailing Lists
+ === Mailing Lists ===
  We currently use a mix of mailing lists. We will migrate our existing mailing lists to the
following:
- 
- dev@datasketches.incubator.apache.org
+  * dev@datasketches.incubator.apache.org
- user@datasketches.incubator.apache.org
+  * user@datasketches.incubator.apache.org
- private@datasketches.incubator.apache.org
+  * private@datasketches.incubator.apache.org
- commits@datasketches.incubator.apache.org
+  * commits@datasketches.incubator.apache.org
- Source Control
+ === Source Control ===
  The DataSketches team currently uses Git and would like to continue to do so. We request
a Git repository for DataSketches with mirroring to GitHub enabled similar the following:

  
- https://gitbox.apache.org/repos/asf/incubator-datasketches.git
+  * https://gitbox.apache.org/repos/asf/incubator-datasketches.git
- https://github.com/apache/incubator-datasketches.git
+  * https://github.com/apache/incubator-datasketches.git
- Issue Tracking
+ === Issue Tracking ===
  We request the creation of an Apache-hosted JIRA. The DataSketches project is currently
using the public GitHub issue tracker and the public Google Groups forum/sketches-user for
issue tracking and discussions. We will migrate and combine from these two sources to the
Apache JIRA. 
  
  Proposed Jira ID: DATASKETCHES
+ 
- Initial Committers
+ == Initial Committers ==
  The following list of individuals have been extremely active in our community and should
have write (commit) permissions to the repository.  
  
- Eshcar Hillel              	[eshcar at verizonmedia dot com]
+  * Eshcar Hillel              	[eshcar at verizonmedia dot com]
- Kevin Lang            	[langk at verizonmedia dot com]
+  * Kevin Lang            	[langk at verizonmedia dot com]
- Roman Leventov      	[leventov at apache dot org]
+  * Roman Leventov      	[leventov at apache dot org]
- Edo Liberty           	[libertye at amazon dot com]
+  * Edo Liberty           	[libertye at amazon dot com]
- Jon Malkin            	[jmalkin at verizonmedia dot com]
+  * Jon Malkin            	[jmalkin at verizonmedia dot com]
- Lee Rhodes          	[lrhodes at verizonmedia dot com] & [leerho at gmail dot com]
+  * Lee Rhodes          	[lrhodes at verizonmedia dot com] & [leerho at gmail dot com]
- Alexander Saydakov 	[saydakov at verizonmedia dot com]
+  * Alexander Saydakov 	[saydakov at verizonmedia dot com]
- Justin Thaler         	[justin.thaler at georgetown dot edu]
+  * Justin Thaler         	[justin.thaler at georgetown dot edu]
- Affiliations
+ == Affiliations ==
  The initial committers are from four organizations: Yahoo, Amazon, Georgetown University,
and Metamarkets/Snap.
- Champion
+ == Champion ==
  Jean-Baptiste Onofré, [jb at nanthrax dot net]
- Nominated Mentors
+ == Nominated Mentors ==
  Liang Chen, [chenliang613 at apache dot org] 
  Kenneth Knowles, [kenn at apache dot org]
  Furkan Kamaci, [furkankamacikamaci at gmailapache dot comorg]
- Sponsoring Entity
+ == Sponsoring Entity ==
- The Apache Incubator    **** This is our 1st choice ****
+  * The Apache Incubator    **** This is our 1st choice ****
- Apache Druid. The incubating Apache Druid project might also be a logical sponsor. However,
DataSketches has applications in many areas of computing outside of Druid so our preference
and recommendation is that DataSketches would ultimately be a top-level Apache project.
+  * Apache Druid. The incubating Apache Druid project might also be a logical sponsor. However,
DataSketches has applications in many areas of computing outside of Druid so our preference
and recommendation is that DataSketches would ultimately be a top-level Apache project.
- Appendix
+ == Appendix ==
- Academic Classes on Streaming and Sketching Algorithms
+ === Academic Classes on Streaming and Sketching Algorithms ===
  We have identified only a few universities in the U.S. that offer classes devoted to streaming
and sketching algorithms at the advanced graduate level. These courses exist because these
topics are not covered in any standard undergraduate or graduate algorithms classes.
  
  These include:
- Amit Chakrabarti's course at Dartmouth (Data Stream Algorithms, last offered 2015): https://www.cs.dartmouth.edu/~ac/Teach/CS35-Fall15/
+  * Amit Chakrabarti's course at Dartmouth (Data Stream Algorithms, last offered 2015): https://www.cs.dartmouth.edu/~ac/Teach/CS35-Fall15/
- Andrew McGregor's course at UMass Amherst (https://people.cs.umass.edu/~mcgregor/courses/CS711S18/index.html).
This is course sometimes called "More Advanced Algorithms", and sometimes called "Data Stream
Algorithms.
+  * Andrew McGregor's course at UMass Amherst (https://people.cs.umass.edu/~mcgregor/courses/CS711S18/index.html).
This is course sometimes called "More Advanced Algorithms", and sometimes called "Data Stream
Algorithms.
- Justin Thaler’s course at Georgetown, called "Streaming Algorithms" http://people.cs.georgetown.edu/jthaler/COSC548.html.
+  * Justin Thaler’s course at Georgetown, called "Streaming Algorithms" http://people.cs.georgetown.edu/jthaler/COSC548.html.
- Paul Beame's course at University of Washington, called Sublinear (and Streaming) Algorithms:
https://courses.cs.washington.edu/courses/cse522/14sp/lectures/index.html.
+  * Paul Beame's course at University of Washington, called Sublinear (and Streaming) Algorithms:
https://courses.cs.washington.edu/courses/cse522/14sp/lectures/index.html.
- Jelani Nelson's course at Harvard, called "Sketching Algorithms for Big Data": https://www.sketchingbigdata.org/fall17/.
+  * Jelani Nelson's course at Harvard, called "Sketching Algorithms for Big Data": https://www.sketchingbigdata.org/fall17/.
  
  All of the courses above are taught by theorists, and with the possible exception of Professor
Thaler’s course, don't cover many of the algorithms or concepts most central to the Data
Sketches library. 
  

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message