incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "DRATProposal" by ChrisMattmann
Date Wed, 02 Aug 2017 16:48:46 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "DRATProposal" page has been changed by ChrisMattmann:
https://wiki.apache.org/incubator/DRATProposal?action=diff&rev1=16&rev2=17

  As a part of the Apache Software Foundation (ASF) project, Apache Creadur, a Release Audit
Tool (RAT) was developed especially in response to demand from the Apache Software Foundation
and its hundreds of projects to provide a capability for release auditing that could be integrated
into projects. The primary function of the RAT is automated code auditing and open-source
license analysis focusing on headers. RAT is a natural language processing tool written in
Java to easily run on any platform and to audit code from many source languages (e.g., C,
C++, Java, Python, etc.). RAT can also be used to add license headers to codes that are not
licensed.
  
  In the summer of 2013, our team ran Apache RAT on source code produced from the Defense
Advanced Research Projects Agency (DARPA) XDATA national initiative whose inception coincided
with the 2012 U.S. Presidential Initiative in Big Data. XDATA brought together 24 performers
across academia, private industry and the government to construct analytics, visualizations,
and open source software mash-ups that were transitioned into government projects and to the
defense sector. XDATA produced a large Git repository consisting of ~50,000 files and 10s
of millions of lines of code. DARPA XDATA was launched to build a useful infrastructure for
many government agencies and ultimately is an effort to avoid the traditional government-contractor
software pipeline in which additional contracts are required to reuse and to unlock software
previously funded by the government in other programs.
- All XDATA software is open source and is ingested into DARPA’s Open Catalog [6] that points
to outputs of the program including its source code and metrics on the repository. Because
of this, one of core products of XDATA is the internal Git repository. Since XDATA brought
together open source software across multiple performers, having an understanding of the licenses
that the source codes used, and their compatibilities and differences was extremely important
and since there repository was so large, our strategy was to develop an automated process
using Apache RAT.
+ All XDATA software is open source and is ingested into [[https://opencatalog.darpa.mil/|DARPA’s
Open Catalog]] that points to outputs of the program including its source code and metrics
on the repository. Because of this, one of core products of XDATA is the internal Git repository.
Since XDATA brought together open source software across multiple performers, having an understanding
of the licenses that the source codes used, and their compatibilities and differences was
extremely important and since there repository was so large, our strategy was to develop an
automated process using Apache RAT.
  We ran RAT on 24-core, 48 GB RAM Linux machine at the National Aeronautics and Space Administration
(NASA)’s Jet Propulsion Laboratory (JPL) to produce a license evaluation of the XDATA Git
repository and to provide recommendations on how the open source software products can be
combined to adhere to the XDATA open source policy encouraging permissive licenses. Against
our expectations, however, RAT failed to successfully and quickly audit XDATA’s large Git
repository. Moreover, RAT provided no incremental output, resulting in solely a final report
when a task was completed. RAT’s crawler did not automatically discern between binary file
types and another file types. It seemed that RAT performed better by collecting similar sets
of files together (e.g., all Javascript, all C++, all Java) and then running RAT jobs individually
based on file types on smaller increments of files (e.g., 100 Java files at a time, etc).
- The lessons learned navigating these issues have motivated to create “DRAT”, which stands
for "Distributed Release Audit Tool". DRAT directly overcomes RAT's limitations and brings
code auditing and open source license analysis into the realm of Big Data using scalable open
source Apache technologies. DRAT is already being applied and transitioned into the government
agencies. DRAT currently exists at Github under the ALv2
+ The lessons learned navigating these issues have motivated to create “DRAT”, which stands
for "Distributed Release Audit Tool". DRAT directly overcomes RAT's limitations and brings
code auditing and open source license analysis into the realm of Big Data using scalable open
source Apache technologies. DRAT is already being applied and transitioned into the government
agencies. DRAT currently exists at Github under the ALv2 under Chris Mattmann's GitHub account.
Chris Mattmann was the PI of DARPA XDATA at JPL.
  
  == Current Status ==
  
@@ -43, +43 @@

  Mention JPL folks
  Tyler is at Google
  Karanjeet formerly of USC + JPL and now Apple
+ former USC students
  
  
  == Alignment ==
  
+ TBD
- Apache is, by far, the most natural home for taking the AsterixDB
- project forward. A large fraction of today's top Big Data
- technologies have their homes in Apache, including Hadoop, YARN, Pig,
- Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
- significant gap -- the parallel data management system gap -- that
- exists in the Big Data open source world. It is well-aligned with a
- number of the Apache projects, e.g., it has strong support for
- accessing and indexing external data in HDFS, and it uses YARN as an
- answer to basic cluster resource management. AsterixDB also seeks to
- achieve an Apache-style development model; it is seeking a broader
- community of contributors and users in order to achieve its full
- potential and value to the Big Data community.
- 
- There are also a number of related Apache projects and dependencies
- that will be mentioned below in the Relationships with Other Apache
- products section.
  
  
  == Known Risks ==
  
  === Orphaned products ===
  
+ JPL making a commitment to run DRAT on our internal code repos
+ TBD
- Given the current level of intellectual investment in AsterixDB, the
- risk of the project being abandoned is very small. The UCI/UCR
- faculty team leads are highly incentivized to continue development
- since the database groups at UC Irvine and UC Riverside are both
- reliant on AsterixDB as a platform for long-term graduate research
- projects. UC San Diego is also beginning to contribute to the code
- base, and a collaboration involving public health applications is
- forming with UCLA. The work on AsterixDB is managed via a mix of
- mailing list discussions supplemented by weekly project status
- meetings which are summarized on the mailing list. Typical (local
- plus Skype-in) attendance to the weekly status meetings runs at about
- 20 active contributors.
  
  === Inexperience with Open Source ===
  
+ TBD
- AsterixDB and Hyracks were completely developed in Open Source under
- the ALv2. The source code repositories, issue tracker, and mailing
- lists are available on Google Code and discussions and decisions
- happen on the mailing lists (which is necessary due to the geographic
- distribution of the current developers).
- 
- Also a few of the initial committers have contributed to Apache
- projects. Vinayak Borkar is a committer on the Apache Helix and
- Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
- and an IPMC member. Preston Carman and Steven Jacobs are committers
- on the Apache VXQuery project.
  
  
  === Relationships with Other Apache Products ===
  
+ RAT, OODT, Tika, Lucene, Solr, Wicket
- Apache VXQuery is based on the Hyracks data-parallel runtime, which
- is also included in the AsterixDB code base.
- 
- AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
- is support for accessing external data in HDFS (and Hive formats),
- and resource management and system administration features are in the
- process of being migrated to YARN.
- 
- AsterixDB's AQL query facilities offer comparable query power to
- Apache's Pig and Hive systems for big data analytics. AsterixDB
- differs in storing and indexing data and thus being able to quickly
- answer small and medium queries without large HDFS data scans -
- thereby targeting a different class of use cases.
- 
- AsterixDB's data storage and indexing facilities are similar to those
- of HBase, but AsterixDB differs in being a much more complete and
- queryable BDMS (not just a key-value style store).
- 
- AsterixDB's target use cases are not in-memory processing or
- iterative algorithm support, making AsterixDB complementary to the
- Apache Spark platform. (Spark interoperability is on our longer-term
- to-do wishlist.)
- 
  
  === Homogeneous Developers ===
  
+ TBD
- As mentioned before the current community is already organizationally
- and geographically distributed - and we would like to increase the
- heterogeneity.
  
  
  === Reliance on Salaried Developers ===
  
+ TBD
- Of the initial committers only 3 are full-time UCI staff. The other
- committers are a mix of students, alumni who continue to contribute
- to the effort, and individuals working with permission part-time (or
- in spare time) on this project.
  
  
  === A Excessive Fascination with the Apache Brand ===
  
+ TBD
- We believe in the processes, systems, and framework Apache has put in
- place. Apache is also known to foster a great community around their
- projects and provide exposure. While brand is important, our
- fascination with it is not excessive. We believe that the ASF is the
- right home for AsterixDB and that having AsterixDB inside of the ASF
- will lead to a better long-term outcome for the Big Data community.
  
  
  === Documentation ===
  
+ Documentation including code, a wiki, and publications surrounding DRAT can be found at
http://github.com/chrismattmann/drat/.
- Documentation and publications related to AsterixDB can be found at
- http://asterixdb.ics.uci.edu/.
  
  
  === Initial Source ===
  
+ Documentation including code, a wiki, and publications surrounding DRAT can be found at
http://github.com/chrismattmann/drat/
- Current source resides in Google code:
- https://code.google.com/p/asterixdb/ (query language and upper system
- layers) and https://code.google.com/p/hyracks/ (dataflow runtime
- system and storage management libraries).
  
  
  === External Dependencies ===
  
  AsterixDB depends on a number of Apache projects:
  
+  * OODT
+  * Lucene
+  * RAT
+  * Solr
+  * Tika
-  * Ant
-  * Avro
-  * ApacheDB JDO
-  * Commons
-  * Derby
-  * Hadoop
-  * Hive
-  * HTTPComponents
-  * Jakarta ORO
-  * Maven
-  * Tomcat
-  * Thrift
-  * Velocity
   * Wicket
-  * Xerces
  
  and other open source projects (organized by license):
  
-  * ALv2:
-   * Jackson
-   * Google Guava
-   * Google Guice
-   * JSON-simple
-   * BoneCP
-   * Microsoft Azure SDK
-   * Netty
-   * Rome
-   * !JetS3t
-   * Groovy
-   * Jettison
-   * Plexus
-   * Datanucleus (JDO)
-   * Jetty
-   * Twitter4J
-   * Snappy-java
- 
-  * BSD:
-   * Antlr
-   * !ObjectWeb ASM
-   * Protobuf
-   * JSCH
-   * JavaCC
-   * Paranamer
-   * JLine
-   * Stax
-   * !StringTemplate
-   * xmlEnc
- 
-  * MIT
-   * !AppAssembler
-   * SimpleLog4J
- 
-  * CDDL 1.0
-   * Java Activation Framework
-   * Java Transactions
-   * Java Servlet API
-   * Grizzly
-   * gmbal
-   * Glassfish
- 
-  * CDDL 1.1
-   * Jersey
-   * JAXB Reference Implementation
- 
-  * JSON License
-   * JSON
- 
-  * EPL 1.0
-   * JUnit
- 
-  * JDOM License
-   * JDOM
- 
-  * Public Domain
-   * xz
-   * AOPAlliance
  
  As all dependencies are managed using Apache Maven, none of the
  external libraries need to be packaged in a source distribution.
@@ -251, +114 @@

  
  === Developer and user mailing lists ===
  
-  * private@asterixdb.incubator.apache.org (with moderated subscriptions)
+  * private@drat.incubator.apache.org (with moderated subscriptions)
-  * commits@asterixdb.incubator.apache.org
+  * commits@drat.incubator.apache.org
-  * dev@asterixdb.incubator.apache.org
+  * dev@drat.incubator.apache.org
-  * users@asterixdb.incubator.apache.org
  
  
- A git repository
+ A gitbox repository at:
  
- https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
+ https://github.com/apache/drat.git
  
+ Issue tracking
  
+ We will use the GitHub issue tracker.
- A JIRA issue tracker
- 
- https://issues.apache.org/jira/browse/ASTERIXDB
  
  
  == Initial Committers ==
@@ -273, +134 @@

  active subset of the committers for the current repository at Google
  code).
  
+  * Chris Mattman
+  * Tyler Palsulich
+  * Paul Ramirez
+  * Lewis John McGibbney
+  * Karanjeet Singh
+  * Steven Francus
+  * Michael Joyce
-  * Abdullah Alamoudi (bamousaa@gmail.com)
-  * Cameron Samak (eufery@gmail.com)
-  * Chen Li (chenli@gmail.com)
-  * Ian Maxon (imaxon@uci.edu)
-  * Inci Cetindil (icetindil@gmail.com)
-  * Ildar Absalyamov (ildar.absalyamov@gmail.com)
-  * Jianfeng Jia (jianfeng.jia@gmail.com)
-  * Keren Ouaknine (kereno@gmail.com)
-  * Markus Dreseler (apache@dreseler.de)
-  * Mike Carey (dtabass@apache.org)
-  * Murtadha Hubail (hubailmor@gmail.com)
-  * Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
-  * Preston Carman (prestonc@apache.org)
-  * Raman Grover (ramangrover29@gmail.com)
-  * Sattam Alsubaiee (salsubaiee@gmail.com)
-  * Steven Jacobs (sjaco002@apache.org)
-  * Taewoo Kim (wangsaeu@gmail.com)
-  * Till Westmann (tillw@apache.org)
-  * Vassilis Tsotras (tsotras@cs.ucr.edu)
-  * Vinayak Borkar (vinayakb@apache.org)
-  * Yingyi Bu (buyingyi@gmail.com)
-  * Young-Seok Kim (kisskys@gmail.com)
-  * Zach Heilbron (zheilbron@gmail.com)
  
  
  == Affiliations ==
  
+ NASA JPL
+  * Chris Mattmann
+  * Paul Ramirez
+  * Lewis John McGibbney
+  * Michael Joyce
- UC Irvine
-  * Mike Carey
-  * Chen Li
-  * Ian Maxon
-  * Inci Cetindil
-  * Yingyi Bu
-  * Raman Grover
-  * Pouria Pirzadeh
-  * Young-Seok Kim
-  * Cameron Samak
-  * Taewoo Kim
-  * Jianfeng Jia
-  * Murtadha Hubail
-  * Markus Dreseler
  
+ Apple
+  * Karanjeet Singh
- UC Riverside
-  * Ildar Absalyamov
-  * Preston Carman
-  * Steven Jacobs
-  * Vassilis Tsotras
  
- Hebrew University
-  * Keren Ouaknine
+ Google
+  * Tyler Palsulich
  
- Oracle
-  * Till Westmann
+ Chronaly
+  * Steven Francus
  
- X15 Software
-  * Vinayak Borkar
-  * Zach Heilbron
- 
- KACST Saudi Arabia
-  * Sattam Alsubaiee
- 
- Saudi Aramco
-  * Abdullah Alamoudi
- 
- Carey, Li, and Maxon are full-time UCI (UC Irvine) staff, Tsotras is
- full-time UCR (UC Riverside) staff, with the remaining UCI and UCR
- affiliates being students. The non-UC committers are a mix of alumni
- who continue to contribute to the effort and individuals working
- with permission part-time (or in spare time) on this project.
  
  
  == Sponsors ==
@@ -352, +170 @@

  
  === Nominated Mentors ===
  
-  * Henry Saputra
-  * Jochen Wiedmann
-  * Ted Dunning
-  * Ate Douma
+  * Chris Mattmann
+  * Paul Ramirez
+  * Lewis John McGibbney
+ [others]
  
  === Sponsoring Entity ===
  
- The Apache Incubator
+ The Apache Board (pTLP) or...the Apache Incubator.
  

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message