Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9091E18AEF for ; Tue, 19 Jan 2016 03:57:07 +0000 (UTC) Received: (qmail 53817 invoked by uid 500); 19 Jan 2016 03:57:06 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 53610 invoked by uid 500); 19 Jan 2016 03:57:06 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 53598 invoked by uid 99); 19 Jan 2016 03:57:06 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jan 2016 03:57:06 +0000 Received: from mail-ig0-f174.google.com (mail-ig0-f174.google.com [209.85.213.174]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 45DB81A0280 for ; Tue, 19 Jan 2016 03:57:06 +0000 (UTC) Received: by mail-ig0-f174.google.com with SMTP id mw1so60230379igb.1 for ; Mon, 18 Jan 2016 19:57:06 -0800 (PST) X-Gm-Message-State: AG10YOSgLeeimXB3LXSQY/h9rtuBCeMtQalwK7+LGBquByeK9ZPzw56Z3bI5ih6ZErjHrLqCnYPTPDb8cyl6mw== MIME-Version: 1.0 X-Received: by 10.50.142.7 with SMTP id rs7mr14309589igb.90.1453175825556; Mon, 18 Jan 2016 19:57:05 -0800 (PST) Received: by 10.107.32.19 with HTTP; Mon, 18 Jan 2016 19:57:05 -0800 (PST) Date: Mon, 18 Jan 2016 19:57:05 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine Translation Toolkit From: Henri Yandell To: "Mattmann, Chris A (3980)" , general@incubator.apache.org Content-Type: multipart/alternative; boundary=001a11c2eb5ae0eefd0529a7dcde --001a11c2eb5ae0eefd0529a7dcde Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Non-binding +1 to Joshua joining the Incubator. I'd be interested in mentoring. > -----Original Message----- > From: jpluser > Reply-To: "general@incubator.apache.org" > Date: Tuesday, January 12, 2016 at 10:56 PM > To: "general@incubator.apache.org" > Cc: "post@cs.jhu.edu" > Subject: [DISCUSS] Apache Joshua Incubator Proposal - Machine Translation > Toolkit > > >Hi Everyone, > > > >Please find attached for your viewing pleasure a proposed new project, > >Apache Joshua, a statistical machine translation toolkit. The proposal > >is in wiki draft form at: > https://wiki.apache.org/incubator/JoshuaProposal > > > >Proposal text is copied below. I=E2=80=99ll leave the discussion open fo= r a week > >and we are interested in folks who would like to be initial committers > >and mentors. Please discuss here on the thread. > > > >Thanks! > > > >Cheers, > >Chris (Champion) > > > >=E2=80=94=E2=80=94=E2=80=94 > > > >=3D Joshua Proposal =3D > > > >=3D=3D Abstract =3D=3D > >[[joshua-decoder.org|Joshua]] is an open-source statistical machine > >translation toolkit. It includes a Java-based decoder for translating wi= th > >phrase-based, hierarchical, and syntax-based translation models, a > >Hadoop-based grammar extractor (Thrax), and an extensive set of tools an= d > >scripts for training and evaluating new models from parallel text. > > > >=3D=3D Proposal =3D=3D > >Joshua is a state of the art statistical machine translation system that > >provides a number of features: > > > > * Support for the two main paradigms in statistical machine translation= : > >phrase-based and hierarchical / syntactic. > > * A sparse feature API that makes it easy to add new feature templates > >supporting millions of features > > * Native implementations of many tuners (MERT, MIRA, PRO, and AdaGrad) > > * Support for lattice decoding, allowing upstream NLP tools to expose > >their hypothesis space to the MT system > > * An efficient representation for models, allowing for quick loading of > >multi-gigabyte model files > > * Fast decoding speed (on par with Moses and mtplz) > > * Language packs =E2=80=94 precompiled models that allow the decoder to= be run as > >a black box > > * Thrax, a Hadoop-based tool for learning translation models from > >parallel text > > * A suite of tools for constructing new models for any language pair fo= r > >which sufficient training data exists > > > >=3D=3D Background and Rationale =3D=3D > >A number of factors make this a good time for an Apache project focused = on > >machine translation (MT): the quality of MT output (for many language > >pairs); the average computing resources available on computers, relative > >to the needs of MT systems; and the availability of a number of > >high-quality toolkits, together with a large base of researchers working > >on them. > > > >Over the past decade, machine translation (MT; the automatic translation > >of one human language to another) has become a reality. The research int= o > >statistical approaches to translation that began in the early nineties, > >together with the availability of large amounts of training data, and > >better computing infrastructure, have all come together to produce > >translations results that are =E2=80=9Cgood enough=E2=80=9D for a large = set of language > >pairs and use cases. Free services like > >[[https://www.bing.com/translator|Bing Translator]] and > >[[https://translate.google.com|Google Translate]] have made these > services > >available to the average person through direct interfaces and through > >tools like browser plugins, and sites across the world with higher > >translation needs use them to translate their pages through automaticall= y. > > > >MT does not require the infrastructure of large corporations in order to > >produce feasible output. Machine translation can be resource-intensive, > >but need not be prohibitively so. Disk and memory usage are mostly a > >matter of model size, which for most language pairs is a few gigabytes a= t > >most, at which size models can provide coverage on the order of tens or > >even hundreds of thousands of words in the input and output languages. T= he > >computational complexity of the algorithms used to search for translatio= ns > >of new sentences are typically linear in the number of words in the inpu= t > >sentence, making it possible to run a translation engine on a personal > >computer. > > > >The research community has produced many different open source translati= on > >projects for a range of programming languages and under a variety of > >licenses. These projects include the core =E2=80=9Cdecoder=E2=80=9D, whi= ch takes a model > >and uses it to translate new sentences between the language pair the mod= el > >was defined for. They also typically include a large set of tools that > >enable new models to be built from large sets of example translations > >(=E2=80=9Cparallel data=E2=80=9D) and monolingual texts. These toolkits = are usually built > >to support the agendas of the (largely) academic researchers that build > >them: the repeated cycle of building new models, tuning model parameters > >against development data, and evaluating them against held-out test data= , > >using standard metrics for testing the quality of MT output. > > > >Together, these three factors=E2=80=94the quality of machine translation= output, > >the feasibility of translating on standard computers, and the availabili= ty > >of tools to build models=E2=80=94make it reasonable for the end users to= use MT as > >a black-box service, and to run it on their personal machine. > > > >These factors make it a good time for an organization with the status of > >the Apache Foundation to host a machine translation project. > > > >=3D=3D Current Status =3D=3D > >Joshua was originally ported from David Chiang=E2=80=99s Python implemen= tation of > >Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins > >University. The current version is maintained by Matt Post at Johns > >Hopkins=E2=80=99 Human Language Technology Center of Excellence. Joshua = has made > >many releases with a list of over 20 source code tags. The last release = of > >Joshua was 6.0.5 on November 5th, 2015. > > > >=3D=3D Meritocracy =3D=3D > >The current developers are familiar with meritocratic open source > >development at Apache. Apache was chosen specifically because we want to > >encourage this style of development for the project. > > > >=3D=3D Community =3D=3D > >Joshua is used widely across the world. Perhaps its biggest (known) > >research / industrial user is the Amazon research group in Berlin. Anoth= er > >user is the US Army Research Lab. No formal census has been undertaken, > >but posts to the Joshua technical support mailing list, along with the > >occasional contributions, suggest small research and academic communitie= s > >spread across the world, many of them in India. > > > >During incubation, we will explicitly seek to increase our usage across > >the board, including academic research, industry, and other end users > >interested in statistical machine translation. > > > >=3D=3D Core Developers =3D=3D > >The current set of core developers is fairly small, having fallen with t= he > >graduation from Johns Hopkins of some core student participants. However= , > >Joshua is used fairly widely, as mentioned above, and there remains a > >commitment from the principal researcher at Johns Hopkins to continue to > >use and develop it. Joshua has seen a number of new community members > >become interested recently due to a potential for its projected use in a > >number of ongoing DARPA projects such as XDATA and Memex. > > > >=3D=3D Alignment =3D=3D > >Joshua is currently Copyright (c) 2015, Johns Hopkins University All > >rights reserved and licensed under BSD 2-clause license. It would of > >course be the intention to relicense this code under AL2.0 which would > >permit expanded and increased use of the software within Apache projects= . > >There is currently an ongoing effort within the Apache Tika community to > >utilize Joshua within Tika=E2=80=99s Translate API, see > >[[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]]. > > > >=3D=3D Known Risks =3D=3D > > > >=3D=3D=3D Orphaned products =3D=3D=3D > >At the moment, regular contributions are made by a single contributor, t= he > >lead maintainer. He (Matt Post) plans to continue development for the ne= xt > >few years, but it is still a single point of failure, since the graduate > >students who worked on the project have moved on to jobs, mostly in > >industry. However, our goal is to help that process by growing the > >community in Apache, and at least in growing the community with users an= d > >participants from NASA JPL. > > > >=3D=3D=3D Inexperience with Open Source =3D=3D=3D > >The team both at Johns Hopkins and NASA JPL have experience with many OS= S > >software projects at Apache and elsewhere. We understand "how it works" > >here at the foundation. > > > > > >=3D=3D Relationships with Other Apache Products =3D=3D > >Joshua includes dependences on Hadoop, and also is included as a plugin = in > >Apache Tika. We are also interested in coordinating with other projects > >including Spark, and other projects needing MT services for language > >translation. > > > >=3D=3D Developers =3D=3D > >Joshua only has one regular developer who is employed by Johns Hopkins > >University. NASA JPL (Mattmann and McGibbney) have been contributing > >lately including a Brew formula and other contributions to the project > >through the DARPA XDATA and Memex programs. > > > >=3D=3D Documentation =3D=3D > >Documentation and publications related to Joshua can be found at > >joshua-decoder.org. The source for the Joshua documentation is currently > >hosted on Github at > >https://github.com/joshua-decoder/joshua-decoder.github.com > > > >=3D=3D Initial Source =3D=3D > >Current source resides at Github: github.com/joshua-decoder/joshua (the > >main decoder and toolkit) and github.com/joshua-decoder/thrax (the > grammar > >extraction tool). > > > >=3D=3D External Dependencies =3D=3D > >Joshua has a number of external dependencies. Only BerkeleyLM (Apache 2.= 0) > >and KenLM (LGPG 2.1) are run-time decoder dependencies (one of which is > >needed for translating sentences with pre-built models). The rest are > >dependencies for the build system and pipeline, used for constructing an= d > >training new models from parallel text. > > > >Apache projects: > > * Ant > > * Hadoop > > * Commons > > * Maven > > * Ivy > > > >There are also a number of other open-source projects with various > >licenses that the project depends on both dynamically (runtime), and > >statically. > > > >=3D=3D=3D GNU GPL 2 =3D=3D=3D > > * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/ > > > >=3D=3D=3D LGPG 2.1 =3D=3D=3D > > * KenLM: github.com/kpu/kenlm > > > >=3D=3D=3D Apache 2.0 =3D=3D=3D > > * BerkeleyLM: https://code.google.com/p/berkeleylm/ > > > >=3D=3D=3D GNU GPL =3D=3D=3D > > * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html > > > >=3D=3D Required Resources =3D=3D > > * Mailing Lists > > * private@joshua.incubator.apache.org > > * dev@joshua.incubator.apache.org > > * commits@joshua.incubator.apache.org > > > > * Git Repos > > * https://git-wip-us.apache.org/repos/asf/joshua.git > > > > * Issue Tracking > > * JIRA Joshua (JOSHUA) > > > > * Continuous Integration > > * Jenkins builds on https://builds.apache.org/ > > > > * Web > > * http://joshua.incubator.apache.org/ > > * wiki at http://cwiki.apache.org > > > >=3D=3D Initial Committers =3D=3D > >The following is a list of the planned initial Apache committers (the > >active subset of the committers for the current repository on Github). > > > > * Matt Post (post@cs.jhu.edu) > > * Lewis John McGibbney (lewismc@apache.org) > > * Chris Mattmann (mattmann@apache.org) > > > >=3D=3D Affiliations =3D=3D > > > > * Johns Hopkins University > > * Matt Post > > > > * NASA JPL > > * Chris Mattmann > > * Lewis John McGibbney > > > > > >=3D=3D Sponsors =3D=3D > >=3D=3D=3D Champion =3D=3D=3D > > * Chris Mattmann (NASA/JPL) > > > >=3D=3D=3D Nominated Mentors =3D=3D=3D > > * Paul Ramirez > > * Lewis John McGibbney > > * Chris Mattmann > > > >=3D=3D Sponsoring Entity =3D=3D > >The Apache Incubator > > > > > > > > > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >Chris Mattmann, Ph.D. > >Chief Architect > >Instrument Software and Science Data Systems Section (398) > >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >Office: 168-519, Mailstop: 168-527 > >Email: chris.a.mattmann@nasa.gov > >WWW: http://sunset.usc.edu/~mattmann/ > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >Adjunct Associate Professor, Computer Science Department > >University of Southern California, Los Angeles, CA 90089 USA > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > >?B=EF=BF=BDKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK= KKKKKKKCB=EF=BF=BD > >?=EF=BF=BD?[=EF=BF=BD=EF=BF=BDX=EF=BF=BD=EF=BF=BD=DC=9AX=EF=BF=BDK??K[XZ= [?=EF=BF=BD?=EF=BF=BD[=EF=BF=BD\=EF=BF=BD[?][=EF=BF=BD=EF=BF=BDX=EF=BF=BD= =EF=BF=BD=DC=9AX=EF=BF=BDP?[=EF=BF=BD=EF=BF=BDX=EF=BF=BD]?=DC=8B=EF=BF=BD\?= X=EF=BF=BD?K=EF=BF=BD=DC=99=EF=BF=BDB=EF=BF=BD=EF=BF=BD=DC=88?Y??]?[=DB=98[= ? > >?=EF=BF=BD=EF=BF=BD[X[=EF=BF=BD?=EF=BF=BD??K[XZ[?=EF=BF=BD?=EF=BF=BD[=EF= =BF=BD\=EF=BF=BD[?Z?[???[=EF=BF=BD=EF=BF=BDX=EF=BF=BD]?=DC=8B=EF=BF=BD\?X= =EF=BF=BD?K=EF=BF=BD=DC=99=EF=BF=BDB > > --001a11c2eb5ae0eefd0529a7dcde--