incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Baptiste Onofré ...@nanthrax.net>
Subject Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine Translation Toolkit
Date Tue, 19 Jan 2016 09:46:11 GMT
I would be honoured. However, as I'm champion on other coming proposals, 
and to keep a good help level, I prefer to wait a couple of days to see 
if others jump in. If you need an additional mentor, please let me know.

Thanks Chris !
Regards
JB

On 01/19/2016 08:11 AM, Mattmann, Chris A (3980) wrote:
> Thanks JB - if you are interested in mentoring would appreciate
> the help.
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: Jean-Baptiste Onofré <jb@nanthrax.net>
> Reply-To: "general@incubator.apache.org" <general@incubator.apache.org>
> Date: Monday, January 18, 2016 at 11:01 PM
> To: "general@incubator.apache.org" <general@incubator.apache.org>
> Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine
> Translation Toolkit
>
>> Hi Chris,
>>
>> it looks interesting. I'm looking forward for the vote.
>>
>> Regards
>> JB
>>
>> On 01/13/2016 07:56 AM, Mattmann, Chris A (3980) wrote:
>>> Hi Everyone,
>>>
>>> Please find attached for your viewing pleasure a proposed new project,
>>> Apache Joshua, a statistical machine translation toolkit. The proposal
>>> is in wiki draft form at:
>>> https://wiki.apache.org/incubator/JoshuaProposal
>>>
>>> Proposal text is copied below. I’ll leave the discussion open for a week
>>> and we are interested in folks who would like to be initial committers
>>> and mentors. Please discuss here on the thread.
>>>
>>> Thanks!
>>>
>>> Cheers,
>>> Chris (Champion)
>>>
>>> ———
>>>
>>> = Joshua Proposal =
>>>
>>> == Abstract ==
>>> [[joshua-decoder.org|Joshua]] is an open-source statistical machine
>>> translation toolkit. It includes a Java-based decoder for translating
>>> with
>>> phrase-based, hierarchical, and syntax-based translation models, a
>>> Hadoop-based grammar extractor (Thrax), and an extensive set of tools
>>> and
>>> scripts for training and evaluating new models from parallel text.
>>>
>>> == Proposal ==
>>> Joshua is a state of the art statistical machine translation system that
>>> provides a number of features:
>>>
>>>    * Support for the two main paradigms in statistical machine
>>> translation:
>>> phrase-based and hierarchical / syntactic.
>>>    * A sparse feature API that makes it easy to add new feature templates
>>> supporting millions of features
>>>    * Native implementations of many tuners (MERT, MIRA, PRO, and AdaGrad)
>>>    * Support for lattice decoding, allowing upstream NLP tools to expose
>>> their hypothesis space to the MT system
>>>    * An efficient representation for models, allowing for quick loading
>>> of
>>> multi-gigabyte model files
>>>    * Fast decoding speed (on par with Moses and mtplz)
>>>    * Language packs — precompiled models that allow the decoder to be
>>> run as
>>> a black box
>>>    * Thrax, a Hadoop-based tool for learning translation models from
>>> parallel text
>>>    * A suite of tools for constructing new models for any language pair
>>> for
>>> which sufficient training data exists
>>>
>>> == Background and Rationale ==
>>> A number of factors make this a good time for an Apache project focused
>>> on
>>> machine translation (MT): the quality of MT output (for many language
>>> pairs); the average computing resources available on computers, relative
>>> to the needs of MT systems; and the availability of a number of
>>> high-quality toolkits, together with a large base of researchers working
>>> on them.
>>>
>>> Over the past decade, machine translation (MT; the automatic translation
>>> of one human language to another) has become a reality. The research
>>> into
>>> statistical approaches to translation that began in the early nineties,
>>> together with the availability of large amounts of training data, and
>>> better computing infrastructure, have all come together to produce
>>> translations results that are “good enough” for a large set of language
>>> pairs and use cases. Free services like
>>> [[https://www.bing.com/translator|Bing Translator]] and
>>> [[https://translate.google.com|Google Translate]] have made these
>>> services
>>> available to the average person through direct interfaces and through
>>> tools like browser plugins, and sites across the world with higher
>>> translation needs use them to translate their pages through
>>> automatically.
>>>
>>> MT does not require the infrastructure of large corporations in order to
>>> produce feasible output. Machine translation can be resource-intensive,
>>> but need not be prohibitively so. Disk and memory usage are mostly a
>>> matter of model size, which for most language pairs is a few gigabytes
>>> at
>>> most, at which size models can provide coverage on the order of tens or
>>> even hundreds of thousands of words in the input and output languages.
>>> The
>>> computational complexity of the algorithms used to search for
>>> translations
>>> of new sentences are typically linear in the number of words in the
>>> input
>>> sentence, making it possible to run a translation engine on a personal
>>> computer.
>>>
>>> The research community has produced many different open source
>>> translation
>>> projects for a range of programming languages and under a variety of
>>> licenses. These projects include the core “decoder”, which takes a model
>>> and uses it to translate new sentences between the language pair the
>>> model
>>> was defined for. They also typically include a large set of tools that
>>> enable new models to be built from large sets of example translations
>>> (“parallel data”) and monolingual texts. These toolkits are usually
>>> built
>>> to support the agendas of the (largely) academic researchers that build
>>> them: the repeated cycle of building new models, tuning model parameters
>>> against development data, and evaluating them against held-out test
>>> data,
>>> using standard metrics for testing the quality of MT output.
>>>
>>> Together, these three factors—the quality of machine translation output,
>>> the feasibility of translating on standard computers, and the
>>> availability
>>> of tools to build models—make it reasonable for the end users to use MT
>>> as
>>> a black-box service, and to run it on their personal machine.
>>>
>>> These factors make it a good time for an organization with the status of
>>> the Apache Foundation to host a machine translation project.
>>>
>>> == Current Status ==
>>> Joshua was originally ported from David Chiang’s Python implementation
>>> of
>>> Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins
>>> University. The current version is maintained by Matt Post at Johns
>>> Hopkins’ Human Language Technology Center of Excellence. Joshua has made
>>> many releases with a list of over 20 source code tags. The last release
>>> of
>>> Joshua was 6.0.5 on November 5th, 2015.
>>>
>>> == Meritocracy ==
>>> The current developers are familiar with meritocratic open source
>>> development at Apache. Apache was chosen specifically because we want to
>>> encourage this style of development for the project.
>>>
>>> == Community ==
>>> Joshua is used widely across the world. Perhaps its biggest (known)
>>> research / industrial user is the Amazon research group in Berlin.
>>> Another
>>> user is the US Army Research Lab. No formal census has been undertaken,
>>> but posts to the Joshua technical support mailing list, along with the
>>> occasional contributions, suggest small research and academic
>>> communities
>>> spread across the world, many of them in India.
>>>
>>> During incubation, we will explicitly seek to increase our usage across
>>> the board, including academic research, industry, and other end users
>>> interested in statistical machine translation.
>>>
>>> == Core Developers ==
>>> The current set of core developers is fairly small, having fallen with
>>> the
>>> graduation from Johns Hopkins of some core student participants.
>>> However,
>>> Joshua is used fairly widely, as mentioned above, and there remains a
>>> commitment from the principal researcher at Johns Hopkins to continue to
>>> use and develop it. Joshua has seen a number of new community members
>>> become interested recently due to a potential for its projected use in a
>>> number of ongoing DARPA projects such as XDATA and Memex.
>>>
>>> == Alignment ==
>>> Joshua is currently Copyright (c) 2015, Johns Hopkins University All
>>> rights reserved and licensed under BSD 2-clause license. It would of
>>> course be the intention to relicense this code under AL2.0 which would
>>> permit expanded and increased use of the software within Apache
>>> projects.
>>> There is currently an ongoing effort within the Apache Tika community to
>>> utilize Joshua within Tika’s Translate API, see
>>> [[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]].
>>>
>>> == Known Risks ==
>>>
>>> === Orphaned products ===
>>> At the moment, regular contributions are made by a single contributor,
>>> the
>>> lead maintainer. He (Matt Post) plans to continue development for the
>>> next
>>> few years, but it is still a single point of failure, since the graduate
>>> students who worked on the project have moved on to jobs, mostly in
>>> industry. However, our goal is to help that process by growing the
>>> community in Apache, and at least in growing the community with users
>>> and
>>> participants from NASA JPL.
>>>
>>> === Inexperience with Open Source ===
>>> The team both at Johns Hopkins and NASA JPL have experience with many
>>> OSS
>>> software projects at Apache and elsewhere. We understand "how it works"
>>> here at the foundation.
>>>
>>>
>>> == Relationships with Other Apache Products ==
>>> Joshua includes dependences on Hadoop, and also is included as a plugin
>>> in
>>> Apache Tika. We are also interested in coordinating with other projects
>>> including Spark, and other projects needing MT services for language
>>> translation.
>>>
>>> == Developers ==
>>> Joshua only has one regular developer who is employed by Johns Hopkins
>>> University. NASA JPL (Mattmann and McGibbney) have been contributing
>>> lately including a Brew formula and other contributions to the project
>>> through the DARPA XDATA and Memex programs.
>>>
>>> == Documentation ==
>>> Documentation and publications related to Joshua can be found at
>>> joshua-decoder.org. The source for the Joshua documentation is currently
>>> hosted on Github at
>>> https://github.com/joshua-decoder/joshua-decoder.github.com
>>>
>>> == Initial Source ==
>>> Current source resides at Github: github.com/joshua-decoder/joshua (the
>>> main decoder and toolkit) and github.com/joshua-decoder/thrax (the
>>> grammar
>>> extraction tool).
>>>
>>> == External Dependencies ==
>>> Joshua has a number of external dependencies. Only BerkeleyLM (Apache
>>> 2.0)
>>> and KenLM (LGPG 2.1) are run-time decoder dependencies (one of which is
>>> needed for translating sentences with pre-built models). The rest are
>>> dependencies for the build system and pipeline, used for constructing
>>> and
>>> training new models from parallel text.
>>>
>>> Apache projects:
>>>    * Ant
>>>    * Hadoop
>>>    * Commons
>>>    * Maven
>>>    * Ivy
>>>
>>> There are also a number of other open-source projects with various
>>> licenses that the project depends on both dynamically (runtime), and
>>> statically.
>>>
>>> === GNU GPL 2 ===
>>>    * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/
>>>
>>> === LGPG 2.1 ===
>>>    * KenLM: github.com/kpu/kenlm
>>>
>>> === Apache 2.0 ===
>>>    * BerkeleyLM: https://code.google.com/p/berkeleylm/
>>>
>>> === GNU GPL ===
>>>    * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html
>>>
>>> == Required Resources ==
>>>    * Mailing Lists
>>>      * private@joshua.incubator.apache.org
>>>      * dev@joshua.incubator.apache.org
>>>      * commits@joshua.incubator.apache.org
>>>
>>>    * Git Repos
>>>      * https://git-wip-us.apache.org/repos/asf/joshua.git
>>>
>>>    * Issue Tracking
>>>      * JIRA Joshua (JOSHUA)
>>>
>>>    * Continuous Integration
>>>      * Jenkins builds on https://builds.apache.org/
>>>
>>>    * Web
>>>      * http://joshua.incubator.apache.org/
>>>      * wiki at http://cwiki.apache.org
>>>
>>> == Initial Committers ==
>>> The following is a list of the planned initial Apache committers (the
>>> active subset of the committers for the current repository on Github).
>>>
>>>    * Matt Post (post@cs.jhu.edu)
>>>    * Lewis John McGibbney (lewismc@apache.org)
>>>    * Chris Mattmann (mattmann@apache.org)
>>>
>>> == Affiliations ==
>>>
>>>    * Johns Hopkins University
>>>      * Matt Post
>>>
>>>    * NASA JPL
>>>      * Chris Mattmann
>>>      * Lewis John McGibbney
>>>
>>>
>>> == Sponsors ==
>>> === Champion ===
>>>    * Chris Mattmann (NASA/JPL)
>>>
>>> === Nominated Mentors ===
>>>    * Paul Ramirez
>>>    * Lewis John McGibbney
>>>    * Chris Mattmann
>>>
>>> == Sponsoring Entity ==
>>> The Apache Incubator
>>>
>>>
>>>
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>
>>
>> --
>> Jean-Baptiste Onofré
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message