incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (3980)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine Translation Toolkit
Date Tue, 19 Jan 2016 14:51:57 GMT
Thanks JB, no problem. You are welcome to join so again I will call
a VOTE in a few days, so please add yourself before then. Cheers.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Jean-Baptiste Onofré <jb@nanthrax.net>
Reply-To: "general@incubator.apache.org" <general@incubator.apache.org>
Date: Tuesday, January 19, 2016 at 1:46 AM
To: "general@incubator.apache.org" <general@incubator.apache.org>
Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine
Translation Toolkit

>I would be honoured. However, as I'm champion on other coming proposals,
>and to keep a good help level, I prefer to wait a couple of days to see
>if others jump in. If you need an additional mentor, please let me know.
>
>Thanks Chris !
>Regards
>JB
>
>On 01/19/2016 08:11 AM, Mattmann, Chris A (3980) wrote:
>> Thanks JB - if you are interested in mentoring would appreciate
>> the help.
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Jean-Baptiste Onofré <jb@nanthrax.net>
>> Reply-To: "general@incubator.apache.org" <general@incubator.apache.org>
>> Date: Monday, January 18, 2016 at 11:01 PM
>> To: "general@incubator.apache.org" <general@incubator.apache.org>
>> Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine
>> Translation Toolkit
>>
>>> Hi Chris,
>>>
>>> it looks interesting. I'm looking forward for the vote.
>>>
>>> Regards
>>> JB
>>>
>>> On 01/13/2016 07:56 AM, Mattmann, Chris A (3980) wrote:
>>>> Hi Everyone,
>>>>
>>>> Please find attached for your viewing pleasure a proposed new project,
>>>> Apache Joshua, a statistical machine translation toolkit. The proposal
>>>> is in wiki draft form at:
>>>> https://wiki.apache.org/incubator/JoshuaProposal
>>>>
>>>> Proposal text is copied below. I’ll leave the discussion open for a
>>>>week
>>>> and we are interested in folks who would like to be initial committers
>>>> and mentors. Please discuss here on the thread.
>>>>
>>>> Thanks!
>>>>
>>>> Cheers,
>>>> Chris (Champion)
>>>>
>>>> ———
>>>>
>>>> = Joshua Proposal =
>>>>
>>>> == Abstract ==
>>>> [[joshua-decoder.org|Joshua]] is an open-source statistical machine
>>>> translation toolkit. It includes a Java-based decoder for translating
>>>> with
>>>> phrase-based, hierarchical, and syntax-based translation models, a
>>>> Hadoop-based grammar extractor (Thrax), and an extensive set of tools
>>>> and
>>>> scripts for training and evaluating new models from parallel text.
>>>>
>>>> == Proposal ==
>>>> Joshua is a state of the art statistical machine translation system
>>>>that
>>>> provides a number of features:
>>>>
>>>>    * Support for the two main paradigms in statistical machine
>>>> translation:
>>>> phrase-based and hierarchical / syntactic.
>>>>    * A sparse feature API that makes it easy to add new feature
>>>>templates
>>>> supporting millions of features
>>>>    * Native implementations of many tuners (MERT, MIRA, PRO, and
>>>>AdaGrad)
>>>>    * Support for lattice decoding, allowing upstream NLP tools to
>>>>expose
>>>> their hypothesis space to the MT system
>>>>    * An efficient representation for models, allowing for quick
>>>>loading
>>>> of
>>>> multi-gigabyte model files
>>>>    * Fast decoding speed (on par with Moses and mtplz)
>>>>    * Language packs — precompiled models that allow the decoder to be
>>>> run as
>>>> a black box
>>>>    * Thrax, a Hadoop-based tool for learning translation models from
>>>> parallel text
>>>>    * A suite of tools for constructing new models for any language
>>>>pair
>>>> for
>>>> which sufficient training data exists
>>>>
>>>> == Background and Rationale ==
>>>> A number of factors make this a good time for an Apache project
>>>>focused
>>>> on
>>>> machine translation (MT): the quality of MT output (for many language
>>>> pairs); the average computing resources available on computers,
>>>>relative
>>>> to the needs of MT systems; and the availability of a number of
>>>> high-quality toolkits, together with a large base of researchers
>>>>working
>>>> on them.
>>>>
>>>> Over the past decade, machine translation (MT; the automatic
>>>>translation
>>>> of one human language to another) has become a reality. The research
>>>> into
>>>> statistical approaches to translation that began in the early
>>>>nineties,
>>>> together with the availability of large amounts of training data, and
>>>> better computing infrastructure, have all come together to produce
>>>> translations results that are “good enough” for a large set of
>>>>language
>>>> pairs and use cases. Free services like
>>>> [[https://www.bing.com/translator|Bing Translator]] and
>>>> [[https://translate.google.com|Google Translate]] have made these
>>>> services
>>>> available to the average person through direct interfaces and through
>>>> tools like browser plugins, and sites across the world with higher
>>>> translation needs use them to translate their pages through
>>>> automatically.
>>>>
>>>> MT does not require the infrastructure of large corporations in order
>>>>to
>>>> produce feasible output. Machine translation can be
>>>>resource-intensive,
>>>> but need not be prohibitively so. Disk and memory usage are mostly a
>>>> matter of model size, which for most language pairs is a few gigabytes
>>>> at
>>>> most, at which size models can provide coverage on the order of tens
>>>>or
>>>> even hundreds of thousands of words in the input and output languages.
>>>> The
>>>> computational complexity of the algorithms used to search for
>>>> translations
>>>> of new sentences are typically linear in the number of words in the
>>>> input
>>>> sentence, making it possible to run a translation engine on a personal
>>>> computer.
>>>>
>>>> The research community has produced many different open source
>>>> translation
>>>> projects for a range of programming languages and under a variety of
>>>> licenses. These projects include the core “decoder”, which takes a
>>>>model
>>>> and uses it to translate new sentences between the language pair the
>>>> model
>>>> was defined for. They also typically include a large set of tools that
>>>> enable new models to be built from large sets of example translations
>>>> (“parallel data”) and monolingual texts. These toolkits are usually
>>>> built
>>>> to support the agendas of the (largely) academic researchers that
>>>>build
>>>> them: the repeated cycle of building new models, tuning model
>>>>parameters
>>>> against development data, and evaluating them against held-out test
>>>> data,
>>>> using standard metrics for testing the quality of MT output.
>>>>
>>>> Together, these three factors—the quality of machine translation
>>>>output,
>>>> the feasibility of translating on standard computers, and the
>>>> availability
>>>> of tools to build models—make it reasonable for the end users to use
>>>>MT
>>>> as
>>>> a black-box service, and to run it on their personal machine.
>>>>
>>>> These factors make it a good time for an organization with the status
>>>>of
>>>> the Apache Foundation to host a machine translation project.
>>>>
>>>> == Current Status ==
>>>> Joshua was originally ported from David Chiang’s Python implementation
>>>> of
>>>> Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins
>>>> University. The current version is maintained by Matt Post at Johns
>>>> Hopkins’ Human Language Technology Center of Excellence. Joshua has
>>>>made
>>>> many releases with a list of over 20 source code tags. The last
>>>>release
>>>> of
>>>> Joshua was 6.0.5 on November 5th, 2015.
>>>>
>>>> == Meritocracy ==
>>>> The current developers are familiar with meritocratic open source
>>>> development at Apache. Apache was chosen specifically because we want
>>>>to
>>>> encourage this style of development for the project.
>>>>
>>>> == Community ==
>>>> Joshua is used widely across the world. Perhaps its biggest (known)
>>>> research / industrial user is the Amazon research group in Berlin.
>>>> Another
>>>> user is the US Army Research Lab. No formal census has been
>>>>undertaken,
>>>> but posts to the Joshua technical support mailing list, along with the
>>>> occasional contributions, suggest small research and academic
>>>> communities
>>>> spread across the world, many of them in India.
>>>>
>>>> During incubation, we will explicitly seek to increase our usage
>>>>across
>>>> the board, including academic research, industry, and other end users
>>>> interested in statistical machine translation.
>>>>
>>>> == Core Developers ==
>>>> The current set of core developers is fairly small, having fallen with
>>>> the
>>>> graduation from Johns Hopkins of some core student participants.
>>>> However,
>>>> Joshua is used fairly widely, as mentioned above, and there remains a
>>>> commitment from the principal researcher at Johns Hopkins to continue
>>>>to
>>>> use and develop it. Joshua has seen a number of new community members
>>>> become interested recently due to a potential for its projected use
>>>>in a
>>>> number of ongoing DARPA projects such as XDATA and Memex.
>>>>
>>>> == Alignment ==
>>>> Joshua is currently Copyright (c) 2015, Johns Hopkins University All
>>>> rights reserved and licensed under BSD 2-clause license. It would of
>>>> course be the intention to relicense this code under AL2.0 which would
>>>> permit expanded and increased use of the software within Apache
>>>> projects.
>>>> There is currently an ongoing effort within the Apache Tika community
>>>>to
>>>> utilize Joshua within Tika’s Translate API, see
>>>> [[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]].
>>>>
>>>> == Known Risks ==
>>>>
>>>> === Orphaned products ===
>>>> At the moment, regular contributions are made by a single contributor,
>>>> the
>>>> lead maintainer. He (Matt Post) plans to continue development for the
>>>> next
>>>> few years, but it is still a single point of failure, since the
>>>>graduate
>>>> students who worked on the project have moved on to jobs, mostly in
>>>> industry. However, our goal is to help that process by growing the
>>>> community in Apache, and at least in growing the community with users
>>>> and
>>>> participants from NASA JPL.
>>>>
>>>> === Inexperience with Open Source ===
>>>> The team both at Johns Hopkins and NASA JPL have experience with many
>>>> OSS
>>>> software projects at Apache and elsewhere. We understand "how it
>>>>works"
>>>> here at the foundation.
>>>>
>>>>
>>>> == Relationships with Other Apache Products ==
>>>> Joshua includes dependences on Hadoop, and also is included as a
>>>>plugin
>>>> in
>>>> Apache Tika. We are also interested in coordinating with other
>>>>projects
>>>> including Spark, and other projects needing MT services for language
>>>> translation.
>>>>
>>>> == Developers ==
>>>> Joshua only has one regular developer who is employed by Johns Hopkins
>>>> University. NASA JPL (Mattmann and McGibbney) have been contributing
>>>> lately including a Brew formula and other contributions to the project
>>>> through the DARPA XDATA and Memex programs.
>>>>
>>>> == Documentation ==
>>>> Documentation and publications related to Joshua can be found at
>>>> joshua-decoder.org. The source for the Joshua documentation is
>>>>currently
>>>> hosted on Github at
>>>> https://github.com/joshua-decoder/joshua-decoder.github.com
>>>>
>>>> == Initial Source ==
>>>> Current source resides at Github: github.com/joshua-decoder/joshua
>>>>(the
>>>> main decoder and toolkit) and github.com/joshua-decoder/thrax (the
>>>> grammar
>>>> extraction tool).
>>>>
>>>> == External Dependencies ==
>>>> Joshua has a number of external dependencies. Only BerkeleyLM (Apache
>>>> 2.0)
>>>> and KenLM (LGPG 2.1) are run-time decoder dependencies (one of which
>>>>is
>>>> needed for translating sentences with pre-built models). The rest are
>>>> dependencies for the build system and pipeline, used for constructing
>>>> and
>>>> training new models from parallel text.
>>>>
>>>> Apache projects:
>>>>    * Ant
>>>>    * Hadoop
>>>>    * Commons
>>>>    * Maven
>>>>    * Ivy
>>>>
>>>> There are also a number of other open-source projects with various
>>>> licenses that the project depends on both dynamically (runtime), and
>>>> statically.
>>>>
>>>> === GNU GPL 2 ===
>>>>    * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/
>>>>
>>>> === LGPG 2.1 ===
>>>>    * KenLM: github.com/kpu/kenlm
>>>>
>>>> === Apache 2.0 ===
>>>>    * BerkeleyLM: https://code.google.com/p/berkeleylm/
>>>>
>>>> === GNU GPL ===
>>>>    * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html
>>>>
>>>> == Required Resources ==
>>>>    * Mailing Lists
>>>>      * private@joshua.incubator.apache.org
>>>>      * dev@joshua.incubator.apache.org
>>>>      * commits@joshua.incubator.apache.org
>>>>
>>>>    * Git Repos
>>>>      * https://git-wip-us.apache.org/repos/asf/joshua.git
>>>>
>>>>    * Issue Tracking
>>>>      * JIRA Joshua (JOSHUA)
>>>>
>>>>    * Continuous Integration
>>>>      * Jenkins builds on https://builds.apache.org/
>>>>
>>>>    * Web
>>>>      * http://joshua.incubator.apache.org/
>>>>      * wiki at http://cwiki.apache.org
>>>>
>>>> == Initial Committers ==
>>>> The following is a list of the planned initial Apache committers (the
>>>> active subset of the committers for the current repository on Github).
>>>>
>>>>    * Matt Post (post@cs.jhu.edu)
>>>>    * Lewis John McGibbney (lewismc@apache.org)
>>>>    * Chris Mattmann (mattmann@apache.org)
>>>>
>>>> == Affiliations ==
>>>>
>>>>    * Johns Hopkins University
>>>>      * Matt Post
>>>>
>>>>    * NASA JPL
>>>>      * Chris Mattmann
>>>>      * Lewis John McGibbney
>>>>
>>>>
>>>> == Sponsors ==
>>>> === Champion ===
>>>>    * Chris Mattmann (NASA/JPL)
>>>>
>>>> === Nominated Mentors ===
>>>>    * Paul Ramirez
>>>>    * Lewis John McGibbney
>>>>    * Chris Mattmann
>>>>
>>>> == Sponsoring Entity ==
>>>> The Apache Incubator
>>>>
>>>>
>>>>
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: chris.a.mattmann@nasa.gov
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>>
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>> For additional commands, e-mail: general-help@incubator.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>>
>
>-- 
>Jean-Baptiste Onofré
>jbonofre@apache.org
>http://blog.nanthrax.net
>Talend - http://www.talend.com
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>For additional commands, e-mail: general-help@incubator.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org
Mime
View raw message