incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Danese Cooper <dan...@gmail.com>
Subject Re: [VOTE] Accept Joshua as an Apache Incubator Podling
Date Wed, 03 Feb 2016 15:43:11 GMT
+1 (binding) Accept Joshua as an Apache Incubator podling.

D

> On Jan 30, 2016, at 12:00 PM, Mattmann, Chris A (3980) <chris.a.mattmann@jpl.nasa.gov>
wrote:
> 
> Hi Everyone,
> 
> OK the discussion is now completed. Please VOTE to accept Joshua
> into the Apache Incubator. I’ll leave the VOTE open for at least
> the next 72 hours, with hopes to close it next Friday the 5th of
> February, 2016.
> 
> [ ] +1 Accept Joshua as an Apache Incubator podling.
> [ ] +0 Abstain.
> [ ] -1 Don’t accept Joshua as an Apache Incubator podling because..
> 
> Of course, I am +1 on this. Please note VOTEs from Incubator PMC
> members are binding but all are welcome to VOTE!
> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> 
> -----Original Message-----
> From: jpluser <chris.a.mattmann@jpl.nasa.gov>
> Date: Tuesday, January 12, 2016 at 10:56 PM
> To: "general@incubator.apache.org" <general@incubator.apache.org>
> Cc: "post@cs.jhu.edu" <post@cs.jhu.edu>
> Subject: [DISCUSS] Apache Joshua Incubator Proposal - Machine Translation
> Toolkit
> 
>> Hi Everyone,
>> 
>> Please find attached for your viewing pleasure a proposed new project,
>> Apache Joshua, a statistical machine translation toolkit. The proposal
>> is in wiki draft form at: https://wiki.apache.org/incubator/JoshuaProposal
>> 
>> Proposal text is copied below. I’ll leave the discussion open for a week
>> and we are interested in folks who would like to be initial committers
>> and mentors. Please discuss here on the thread.
>> 
>> Thanks!
>> 
>> Cheers,
>> Chris (Champion)
>> 
>> ———
>> 
>> = Joshua Proposal =
>> 
>> == Abstract ==
>> [[joshua-decoder.org|Joshua]] is an open-source statistical machine
>> translation toolkit. It includes a Java-based decoder for translating with
>> phrase-based, hierarchical, and syntax-based translation models, a
>> Hadoop-based grammar extractor (Thrax), and an extensive set of tools and
>> scripts for training and evaluating new models from parallel text.
>> 
>> == Proposal ==
>> Joshua is a state of the art statistical machine translation system that
>> provides a number of features:
>> 
>> * Support for the two main paradigms in statistical machine translation:
>> phrase-based and hierarchical / syntactic.
>> * A sparse feature API that makes it easy to add new feature templates
>> supporting millions of features
>> * Native implementations of many tuners (MERT, MIRA, PRO, and AdaGrad)
>> * Support for lattice decoding, allowing upstream NLP tools to expose
>> their hypothesis space to the MT system
>> * An efficient representation for models, allowing for quick loading of
>> multi-gigabyte model files
>> * Fast decoding speed (on par with Moses and mtplz)
>> * Language packs — precompiled models that allow the decoder to be run as
>> a black box
>> * Thrax, a Hadoop-based tool for learning translation models from
>> parallel text
>> * A suite of tools for constructing new models for any language pair for
>> which sufficient training data exists
>> 
>> == Background and Rationale ==
>> A number of factors make this a good time for an Apache project focused on
>> machine translation (MT): the quality of MT output (for many language
>> pairs); the average computing resources available on computers, relative
>> to the needs of MT systems; and the availability of a number of
>> high-quality toolkits, together with a large base of researchers working
>> on them.
>> 
>> Over the past decade, machine translation (MT; the automatic translation
>> of one human language to another) has become a reality. The research into
>> statistical approaches to translation that began in the early nineties,
>> together with the availability of large amounts of training data, and
>> better computing infrastructure, have all come together to produce
>> translations results that are “good enough” for a large set of language
>> pairs and use cases. Free services like
>> [[https://www.bing.com/translator|Bing Translator]] and
>> [[https://translate.google.com|Google Translate]] have made these services
>> available to the average person through direct interfaces and through
>> tools like browser plugins, and sites across the world with higher
>> translation needs use them to translate their pages through automatically.
>> 
>> MT does not require the infrastructure of large corporations in order to
>> produce feasible output. Machine translation can be resource-intensive,
>> but need not be prohibitively so. Disk and memory usage are mostly a
>> matter of model size, which for most language pairs is a few gigabytes at
>> most, at which size models can provide coverage on the order of tens or
>> even hundreds of thousands of words in the input and output languages. The
>> computational complexity of the algorithms used to search for translations
>> of new sentences are typically linear in the number of words in the input
>> sentence, making it possible to run a translation engine on a personal
>> computer.
>> 
>> The research community has produced many different open source translation
>> projects for a range of programming languages and under a variety of
>> licenses. These projects include the core “decoder”, which takes a model
>> and uses it to translate new sentences between the language pair the model
>> was defined for. They also typically include a large set of tools that
>> enable new models to be built from large sets of example translations
>> (“parallel data”) and monolingual texts. These toolkits are usually built
>> to support the agendas of the (largely) academic researchers that build
>> them: the repeated cycle of building new models, tuning model parameters
>> against development data, and evaluating them against held-out test data,
>> using standard metrics for testing the quality of MT output.
>> 
>> Together, these three factors—the quality of machine translation output,
>> the feasibility of translating on standard computers, and the availability
>> of tools to build models—make it reasonable for the end users to use MT as
>> a black-box service, and to run it on their personal machine.
>> 
>> These factors make it a good time for an organization with the status of
>> the Apache Foundation to host a machine translation project.
>> 
>> == Current Status ==
>> Joshua was originally ported from David Chiang’s Python implementation of
>> Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins
>> University. The current version is maintained by Matt Post at Johns
>> Hopkins’ Human Language Technology Center of Excellence. Joshua has made
>> many releases with a list of over 20 source code tags. The last release of
>> Joshua was 6.0.5 on November 5th, 2015.
>> 
>> == Meritocracy ==
>> The current developers are familiar with meritocratic open source
>> development at Apache. Apache was chosen specifically because we want to
>> encourage this style of development for the project.
>> 
>> == Community ==
>> Joshua is used widely across the world. Perhaps its biggest (known)
>> research / industrial user is the Amazon research group in Berlin. Another
>> user is the US Army Research Lab. No formal census has been undertaken,
>> but posts to the Joshua technical support mailing list, along with the
>> occasional contributions, suggest small research and academic communities
>> spread across the world, many of them in India.
>> 
>> During incubation, we will explicitly seek to increase our usage across
>> the board, including academic research, industry, and other end users
>> interested in statistical machine translation.
>> 
>> == Core Developers ==
>> The current set of core developers is fairly small, having fallen with the
>> graduation from Johns Hopkins of some core student participants. However,
>> Joshua is used fairly widely, as mentioned above, and there remains a
>> commitment from the principal researcher at Johns Hopkins to continue to
>> use and develop it. Joshua has seen a number of new community members
>> become interested recently due to a potential for its projected use in a
>> number of ongoing DARPA projects such as XDATA and Memex.
>> 
>> == Alignment ==
>> Joshua is currently Copyright (c) 2015, Johns Hopkins University All
>> rights reserved and licensed under BSD 2-clause license. It would of
>> course be the intention to relicense this code under AL2.0 which would
>> permit expanded and increased use of the software within Apache projects.
>> There is currently an ongoing effort within the Apache Tika community to
>> utilize Joshua within Tika’s Translate API, see
>> [[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]].
>> 
>> == Known Risks ==
>> 
>> === Orphaned products ===
>> At the moment, regular contributions are made by a single contributor, the
>> lead maintainer. He (Matt Post) plans to continue development for the next
>> few years, but it is still a single point of failure, since the graduate
>> students who worked on the project have moved on to jobs, mostly in
>> industry. However, our goal is to help that process by growing the
>> community in Apache, and at least in growing the community with users and
>> participants from NASA JPL.
>> 
>> === Inexperience with Open Source ===
>> The team both at Johns Hopkins and NASA JPL have experience with many OSS
>> software projects at Apache and elsewhere. We understand "how it works"
>> here at the foundation.
>> 
>> 
>> == Relationships with Other Apache Products ==
>> Joshua includes dependences on Hadoop, and also is included as a plugin in
>> Apache Tika. We are also interested in coordinating with other projects
>> including Spark, and other projects needing MT services for language
>> translation.
>> 
>> == Developers ==
>> Joshua only has one regular developer who is employed by Johns Hopkins
>> University. NASA JPL (Mattmann and McGibbney) have been contributing
>> lately including a Brew formula and other contributions to the project
>> through the DARPA XDATA and Memex programs.
>> 
>> == Documentation ==
>> Documentation and publications related to Joshua can be found at
>> joshua-decoder.org. The source for the Joshua documentation is currently
>> hosted on Github at
>> https://github.com/joshua-decoder/joshua-decoder.github.com
>> 
>> == Initial Source ==
>> Current source resides at Github: github.com/joshua-decoder/joshua (the
>> main decoder and toolkit) and github.com/joshua-decoder/thrax (the grammar
>> extraction tool).
>> 
>> == External Dependencies ==
>> Joshua has a number of external dependencies. Only BerkeleyLM (Apache 2.0)
>> and KenLM (LGPG 2.1) are run-time decoder dependencies (one of which is
>> needed for translating sentences with pre-built models). The rest are
>> dependencies for the build system and pipeline, used for constructing and
>> training new models from parallel text.
>> 
>> Apache projects:
>> * Ant
>> * Hadoop
>> * Commons
>> * Maven
>> * Ivy
>> 
>> There are also a number of other open-source projects with various
>> licenses that the project depends on both dynamically (runtime), and
>> statically.
>> 
>> === GNU GPL 2 ===
>> * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/
>> 
>> === LGPG 2.1 ===
>> * KenLM: github.com/kpu/kenlm
>> 
>> === Apache 2.0 ===
>> * BerkeleyLM: https://code.google.com/p/berkeleylm/
>> 
>> === GNU GPL ===
>> * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html
>> 
>> == Required Resources ==
>> * Mailing Lists
>>  * private@joshua.incubator.apache.org
>>  * dev@joshua.incubator.apache.org
>>  * commits@joshua.incubator.apache.org
>> 
>> * Git Repos
>>  * https://git-wip-us.apache.org/repos/asf/joshua.git
>> 
>> * Issue Tracking
>>  * JIRA Joshua (JOSHUA)
>> 
>> * Continuous Integration
>>  * Jenkins builds on https://builds.apache.org/
>> 
>> * Web
>>  * http://joshua.incubator.apache.org/
>>  * wiki at http://cwiki.apache.org
>> 
>> == Initial Committers ==
>> The following is a list of the planned initial Apache committers (the
>> active subset of the committers for the current repository on Github).
>> 
>> * Matt Post (post@cs.jhu.edu)
>> * Lewis John McGibbney (lewismc@apache.org)
>> * Chris Mattmann (mattmann@apache.org)
>> 
>> == Affiliations ==
>> 
>> * Johns Hopkins University
>>  * Matt Post
>> 
>> * NASA JPL 
>>  * Chris Mattmann
>>  * Lewis John McGibbney
>> 
>> 
>> == Sponsors ==
>> === Champion ===
>> * Chris Mattmann (NASA/JPL)
>> 
>> === Nominated Mentors ===
>> * Paul Ramirez
>> * Lewis John McGibbney
>> * Chris Mattmann
>> 
>> == Sponsoring Entity ==
>> The Apache Incubator
>> 
>> 
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> ТÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÒÐÐ¥FòVç7V'67&–&RÂRÖÖ–âvVæW&Â×Vç7V'67&–&T–æ7V&F÷"æ6†Ræ÷&pФf÷"FF—F–öæÂ6öÖÖæG2ÂRÖÖ–âvVæW&ÂÖ†VÇ–æ7V&F÷"æ6†Ræ÷&pÐ

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message