incubator-cvs mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Incubator Wiki] Update of "JoshuaProposal" by ChrisMattmann
Date Wed, 13 Jan 2016 06:55:15 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "JoshuaProposal" page has been changed by ChrisMattmann:
https://wiki.apache.org/incubator/JoshuaProposal?action=diff&rev1=10&rev2=11

+ = Joshua Proposal =
- ## page was copied from AccumuloProposal
- = Accumulo Proposal =
  
  == Abstract ==
- Accumulo is a distributed key/value store that provides expressive, cell-level access labels.
+ [[joshua-decoder.org|Joshua]] is an open-source statistical machine translation toolkit.
It includes a Java-based decoder for translating with phrase-based, hierarchical, and syntax-based
translation models, a Hadoop-based grammar extractor (Thrax), and an extensive set of tools
and scripts for training and evaluating new models from parallel text.
  
  == Proposal ==
- Accumulo is a sorted, distributed key/value store based on Google's BigTable design.  It
is built on top of Apache Hadoop, Zookeeper, and Thrift.  It features a few novel improvements
on the BigTable design in the form of cell-level access labels and a server-side programming
mechanism that can modify key/value pairs at various points in the data management process.
+ Joshua is a state of the art statistical machine translation system that provides a number
of features:
  
- == Background ==
- Google published the design of BigTable in 2006.  Several other open source projects have
implemented aspects of this design including HBase, CloudStore, and Cassandra.  Accumulo began
its development in 2008.
+  * Support for the two main paradigms in statistical machine translation: phrase-based and
hierarchical / syntactic.
+  * A sparse feature API that makes it easy to add new feature templates supporting millions
of features
+  * Native implementations of many tuners (MERT, MIRA, PRO, and AdaGrad)
+  * Support for lattice decoding, allowing upstream NLP tools to expose their hypothesis
space to the MT system
+  * An efficient representation for models, allowing for quick loading of multi-gigabyte
model files
+  * Fast decoding speed (on par with Moses and mtplz)
+  * Language packs — precompiled models that allow the decoder to be run as a black box
+  * Thrax, a Hadoop-based tool for learning translation models from parallel text
+  * A suite of tools for constructing new models for any language pair for which sufficient
training data exists
  
- == Rationale ==
- There is a need for a flexible, high performance distributed key/value store that provides
expressive, fine-grained access labels.  The communities we expect to be most interested in
such a project are government, health care, and other industries where privacy is a concern.
 We have made much progress in developing this project over the past 3 years and believe both
the project and the interested communities would benefit from this work being openly available
and having open development.
+ == Background and Rationale ==
+ A number of factors make this a good time for an Apache project focused on machine translation
(MT): the quality of MT output (for many language pairs); the average computing resources
available on computers, relative to the needs of MT systems; and the availability of a number
of high-quality toolkits, together with a large base of researchers working on them.
+ 
+ Over the past decade, machine translation (MT; the automatic translation of one human language
to another) has become a reality. The research into statistical approaches to translation
that began in the early nineties, together with the availability of large amounts of training
data, and better computing infrastructure, have all come together to produce translations
results that are “good enough” for a large set of language pairs and use cases. Free services
like [[https://www.bing.com/translator|Bing Translator]] and  [[https://translate.google.com|Google
Translate]] have made these services available to the average person through direct interfaces
and through tools like browser plugins, and sites across the world with higher translation
needs use them to translate their pages through automatically.
+ 
+ MT does not require the infrastructure of large corporations in order to produce feasible
output. Machine translation can be resource-intensive, but need not be prohibitively so. Disk
and memory usage are mostly a matter of model size, which for most language pairs is a few
gigabytes at most, at which size models can provide coverage on the order of tens or even
hundreds of thousands of words in the input and output languages. The computational complexity
of the algorithms used to search for translations of new sentences are typically linear in
the number of words in the input sentence, making it possible to run a translation engine
on a personal computer.
+ 
+ The research community has produced many different open source translation projects for
a range of programming languages and under a variety of licenses. These projects include the
core “decoder”, which takes a model and uses it to translate new sentences between the
language pair the model was defined for. They also typically include a large set of tools
that enable new models to be built from large sets of example translations (“parallel data”)
and monolingual texts. These toolkits are usually built to support the agendas of the (largely)
academic researchers that build them: the repeated cycle of building new models, tuning model
parameters against development data, and evaluating them against held-out test data, using
standard metrics for testing the quality of MT output.
+ 
+ Together, these three factors—the quality of machine translation output, the feasibility
of translating on standard computers, and the availability of tools to build models—make
it reasonable for the end users to use MT as a black-box service, and to run it on their personal
machine.
+ 
+ These factors make it a good time for an organization with the status of the Apache Foundation
to host a machine translation project.
  
  == Current Status ==
+ Joshua was originally ported from David Chiang’s Python implementation of Hiero by Zhifei
Li, while he was a Ph.D. student at Johns Hopkins University. The current version is maintained
by Matt Post at Johns Hopkins’ Human Language Technology Center of Excellence. Joshua has
made many releases with a list of over 20 source code tags. The last release of Joshua was
6.0.5 on November 5th, 2015.
  
- === Meritocracy ===
+ == Meritocracy ==
- We intend to strongly encourage the community to help with and contribute to the code. 
We will actively seek potential committers and help them become familiar with the codebase.
+ The current developers are familiar with meritocratic open source development at Apache.
Apache was chosen specifically because we want to encourage this style of development for
the project.
  
- === Community ===
+ == Community ==
- A strong government community has developed around Accumulo and training classes have been
ongoing for about a year.  Hundreds of developers use Accumulo.
+ Joshua is used widely across the world. Perhaps its biggest (known) research / industrial
user is the Amazon research group in Berlin. Another user is the US Army Research Lab. No
formal census has been undertaken, but posts to the Joshua technical support mailing list,
along with the occasional contributions, suggest small research and academic communities spread
across the world, many of them in India.
  
+ During incubation, we will explicitly seek to increase our usage across the board, including
academic research, industry, and other end users interested in statistical machine translation.
- === Core Developers ===
- The developers are mainly employed by the National Security Agency, but we anticipate interest
developing among other companies.
  
+ == Core Developers ==
+ The current set of core developers is fairly small, having fallen with the graduation from
Johns Hopkins of some core student participants. However, Joshua is used fairly widely, as
mentioned above, and there remains a commitment from the principal researcher at Johns Hopkins
to continue to use and develop it. Joshua has seen a number of new community members become
interested recently due to a potential for its projected use in a number of ongoing DARPA
projects such as XDATA and Memex.
+ 
- === Alignment ===
+ == Alignment ==
- Accumulo is built on top of Hadoop, Zookeeper, and Thrift.  It builds with Maven.  Due to
the strong relationship with these Apache projects, the incubator is a good match for Accumulo.
+ Joshua is currently Copyright (c) 2015, Johns Hopkins University All rights reserved and
licensed under BSD 2-clause license. It would of course be the intention to relicense this
code under AL2.0 which would permit expanded and increased use of the software within Apache
projects. There is currently an ongoing effort within the Apache Tika community to utilize
Joshua within Tika’s Translate API, see [[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]].
  
  == Known Risks ==
+ 
- === Orphaned Products ===
+ === Orphaned products ===
- There is only a small risk of being orphaned.  The community is committed to improving the
codebase of the project due to its fulfilling needs not addressed by any other software.
+ At the moment, regular contributions are made by a single contributor, the lead maintainer.
He (Matt Post) plans to continue development for the next few years, but it is still a single
point of failure, since the graduate students who worked on the project have moved on to jobs,
mostly in industry. However, our goal is to help that process by growing the community in
Apache, and at least in growing the community with users and participants from NASA JPL.
  
  === Inexperience with Open Source ===
- The codebase has been treated internally as an open source project since its beginning,
and the initial Apache committers have been involved with the code for multiple years.  While
our experience with public open source is limited, we do not anticipate difficulty in operating
under Apache's development process.
+ The team both at Johns Hopkins and NASA JPL have experience with many OSS software projects
at Apache and elsewhere. We understand "how it works" here at the foundation.
  
- === Homogeneous Developers ===
- The committers have multiple employers and it is expected that committers from different
companies will be recruited.
  
- === Reliance on Salaried Developers ===
- The initial committers are all paid by their employers to work on Accumulo and we expect
such employment to continue.  Some of the initial committers would continue as volunteers
even if no longer employed to do so.
+ == Relationships with Other Apache Products ==
+ Joshua includes dependences on Hadoop, and also is included as a plugin in Apache Tika.
We are also interested in coordinating with other projects including Spark, and other projects
needing MT services for language translation.
  
+ == Developers ==
+ Joshua only has one regular developer who is employed by Johns Hopkins University. NASA
JPL (Mattmann and McGibbney) have been contributing lately including a Brew formula and other
contributions to the project through the DARPA XDATA and Memex programs.
- === Relationships with Other Apache Products ===
- Accumulo uses Hadoop, Zookeeper, Thrift, Maven, log4j, commons-lang, -net, -io, -jci, -collections,
-configuration, -logging, and -codec.
- 
- === Relationship to HBase ===
- Accumulo and HBase are both based on the design of Google's BigTable, so there is a danger
that potential users will have difficulty distinguishing the two.  Some of the key areas in
which Accumulo differs from HBase are discussed below.  It may be possible to incorporate
the desired features of Accumulo into HBase.  However, the amount of work required would slow
development of HBase and Accumulo considerably.  We believe this warrants a podling for Accumulo
at the current time.  We expect active cross-pollination will occur between HBase and podling
Accumulo and it is possible that the codebases and projects will ultimately converge.
- 
- ==== Access Labels ====
- Accumulo has an additional portion of its key that sorts after the column qualifier and
before the timestamp.  It is called column visibility and enables expressive cell-level access
control.  Authorizations are passed with each query to control what data is returned to the
user.  The column visibilities are boolean AND and OR combinations of arbitrary strings (such
as "(A&B)|C") and authorizations are sets of strings (such as {C,D}).
- 
- ==== Iterators ====
- Accumulo has a novel server-side programming mechanism that can modify the data written
to disk or returned to the user.  This mechanism can be configured for any of the scopes where
data is read from or written to disk.  It can be used to perform joins on data within a single
tablet.
- 
- ==== Flexibility ====
- HBase requires the user to specify the set of column families to be used up front.  Accumulo
places no restrictions on the column families.  Also, each column family in HBase is stored
separately on disk.  Accumulo allows column families to be grouped together on disk, as does
BigTable.  This enables users to configure how their data is stored, potentially providing
improvements in compression and lookup speeds.  It gives Accumulo a row/column hybrid nature,
while HBase is currently column-oriented.
- 
- ==== Testing ====
- Accumulo has testing frameworks that have resulted in its achieving a high level of correctness
and performance.  We have observed that under some configurations and conditions Accumulo
will outperform HBase and provide greater data integrity.
- 
- ==== Logging ====
- HBase uses a write-ahead log on the Hadoop Distributed File System.  Accumulo has its own
logging service that does not depend on communication with the HDFS NameNode.
- 
- ==== Storage ====
- Accumulo has a relative key file format that improves compression.
- 
- ==== Areas in which HBase features improvements over Accumulo ====
- in memory tables, upserts, coprocessors, connections to other projects such as Cascading
and Pig
- 
- === Expectations ===
- There is a risk that Accumulo will be criticized for not providing adequate security.  The
access labels in Accumulo do not in themselves provide a complete security solution, but are
a mechanism for labeling each piece of data with the authorizations that are necessary to
see it.
- 
- === Apache Brand ===
- Our interest in releasing this code as an Apache incubator project is due to its strong
relationship with other Apache projects, i.e. Accumulo has dependencies on Hadoop, Zookeeper,
and Thrift and has complementary goals to HBase.
  
  == Documentation ==
- There is not currently documentation about Accumulo on the web, but a fair amount of documentation
and training materials exists and will be provided on the Accumulo wiki at apache.org.  Also,
a paper discussing YCSB results for Accumulo will be presented at the 2011 Symposium on Cloud
Computing.
+ Documentation and publications related to Joshua can be found at joshua-decoder.org. The
source for the Joshua documentation is currently hosted on Github at https://github.com/joshua-decoder/joshua-decoder.github.com
  
  == Initial Source ==
+ Current source resides at Github: github.com/joshua-decoder/joshua (the main decoder and
toolkit) and github.com/joshua-decoder/thrax (the grammar extraction tool).
- Accumulo has been in development since spring 2008.  There are hundreds of developers using
it and tens of developers have contributed to it.  The core codebase consists of 200,000 lines
of code (mainly Java) and 100s of pages of documentation.  There are also a few projects built
on top of Accumulo that may be added to its contrib in the future.  These include support
for Hive, Matlab, YCSB, and graph processing.
- 
- == Source and Intellectual Property Submission Plan ==
- Accumulo core code, examples, documention, and training materials will be submitted by the
National Security Agency.
- 
- We will also be soliciting contributions of further plugins from MIT Lincoln Labs, Carnegie
Mellon University, and others.
- 
- Accumulo has been developed by a mix of government employees and private companies under
government contract.  Material developed by government employees is in the public domain and
no U.S. copyright exists in works of the federal government.  For the contractor developed
material in the initial submission, the U.S. Government has sufficient authority per the ICLA
from the copyright owner to contribute the Accumulo code to the incubator.
- 
- There has been some discussion regarding accepting contributions from US Government sources
on https://issues.apache.org/jira/browse/LEGAL-93. We propose that the NSA will sign an ICLA/CCLA
if that document could be slightly modified to explicitly address copyright in works of government
employees. Specifically, we propose that the definition of “You” be modified to include
“the copyright owner, the owner of a Contribution not subject to copyright, or legal entity
authorized by the copyright owner that is making this Agreement.” In addition, section 2,
the copyright license grant be modified after “You hereby grant” that either states “to
the extent authorized by law” or “to the extent copyright exists in the Contribution.”
 These changes will permit US Government employee developed work to be included.
- 
- One proposed solution is to form a Collaborative Research and Development Agreement (CRADA)
between the Apache Software Foundation and the US Government, but this will not solve the
underlying problem that U.S. law does not grant copyright to works of government employees.
 At this time a CRADA is not necessary but should it be determined that a CRADA is necessary,
we would like to work through that process during the incubation phase of Accumulo rather
than before acceptance as this may take time to enter into an agreement.
  
  == External Dependencies ==
- jetty (Apache and EPL), jline (BSD), jfreechart (LGPL), jcommon (LGPL), slf4j (MIT), junit
(CPL)
+ Joshua has a number of external dependencies. Only BerkeleyLM (Apache 2.0) and KenLM (LGPG
2.1) are run-time decoder dependencies (one of which is needed for translating sentences with
pre-built models). The rest are dependencies for the build system and pipeline, used for constructing
and training new models from parallel text.
  
- == Cryptography ==
- none
+ Apache projects:
+  * Ant
+  * Hadoop
+  * Commons
+  * Maven
+  * Ivy
+ 
+ There are also a number of other open-source projects with various licenses that the project
depends on both dynamically (runtime), and statically.
+ 
+ === GNU GPL 2 ===
+  * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/
+ 
+ === LGPG 2.1 ===
+  * KenLM: github.com/kpu/kenlm
+ 
+ === Apache 2.0 ===
+  * BerkeleyLM: https://code.google.com/p/berkeleylm/
+ 
+ === GNU GPL ===
+  * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html
  
  == Required Resources ==
   * Mailing Lists
+    * private@joshua.incubator.apache.org
+    * dev@joshua.incubator.apache.org
+    * commits@joshua.incubator.apache.org
-    * accumulo-private
-    * accumulo-dev
-    * accumulo-commits
-    * accumulo-user
  
-  * Subversion Directory
-    * https://svn.apache.org/repos/asf/incubator/accumulo
+  * Git Repos
+    * https://git-wip-us.apache.org/repos/asf/joshua.git
  
   * Issue Tracking
-    * JIRA Accumulo (ACCUMULO)
+    * JIRA Joshua (JOSHUA)
  
   * Continuous Integration
     * Jenkins builds on https://builds.apache.org/
  
   * Web
-    * http://incubator.apache.org/accumulo/
+    * http://joshua.incubator.apache.org/
-    * wiki at http://wiki.apache.org or http://cwiki.apache.org
+    * wiki at http://cwiki.apache.org
  
  == Initial Committers ==
+ The following is a list of the planned initial Apache committers (the active subset of the
committers for the current repository on Github).
+ 
+  * Matt Post (post@cs.jhu.edu)
+  * Lewis John McGibbney (lewismc@apache.org) 
+  * Chris Mattmann (mattmann@apache.org) 
-  * Aaron Cordova (aaron at cordovas dot org)
-  * Adam Fuchs (adam.p.fuchs at ugov dot gov)
-  * Eric Newton (ecn at swcomplete dot com)
-  * Billie Rinaldi (billie.j.rinaldi at ugov dot gov)
-  * Keith Turner (keith.turner at ptech-llc dot com)
-  * John Vines (john.w.vines at ugov dot gov)
-  * Chris Waring (christopher.a.waring at ugov dot gov)
  
  == Affiliations ==
-  * Aaron Cordova, The Interllective
-  * Adam Fuchs, National Security Agency
-  * Eric Newton, SW Complete Incorporated
-  * Billie Rinaldi, National Security Agency
-  * Keith Turner, Peterson Technology LLC
-  * John Vines, National Security Agency
-  * Chris Waring, National Security Agency
+ 
+  * Johns Hopkins University
+    * Matt Post
+ 
+  * NASA JPL 
+    * Chris Mattmann
+    * Lewis John McGibbney
+ 
  
  == Sponsors ==
-  * Champion: Doug Cutting
+ === Champion ===
+  * Chris Mattmann (NASA/JPL)
  
- == Nominated Mentors ==
+ === Nominated Mentors ===
+  * Paul Ramirez
+  * Lewis John McGibbney
+  * Chris Mattmann
-  * Benson Margulies
-  * Alan Cabrera
-  * Bernd Fondermann
-  * Owen O'Malley
  
  == Sponsoring Entity ==
-  * Apache Incubator
+ The Apache Incubator
  

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org


Mime
View raw message