Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8476E18D23 for ; Mon, 1 Feb 2016 12:21:43 +0000 (UTC) Received: (qmail 56951 invoked by uid 500); 1 Feb 2016 12:21:20 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 56702 invoked by uid 500); 1 Feb 2016 12:21:20 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 56690 invoked by uid 99); 1 Feb 2016 12:21:19 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Feb 2016 12:21:19 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 80E38C0B63 for ; Mon, 1 Feb 2016 12:21:19 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.001 X-Spam-Level: X-Spam-Status: No, score=0.001 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=comcast.net Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id fvfXXnvyzMyw for ; Mon, 1 Feb 2016 12:21:05 +0000 (UTC) Received: from resqmta-ch2-08v.sys.comcast.net (resqmta-ch2-08v.sys.comcast.net [69.252.207.40]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 465DE20271 for ; Mon, 1 Feb 2016 12:21:05 +0000 (UTC) Received: from resomta-ch2-08v.sys.comcast.net ([69.252.207.104]) by resqmta-ch2-08v.sys.comcast.net with comcast id D0Lv1s0052Fh1PH010Lypr; Mon, 01 Feb 2016 12:20:58 +0000 Received: from [192.168.199.10] ([69.251.84.114]) by resomta-ch2-08v.sys.comcast.net with comcast id D0Lx1s0032U0RYt010LxHL; Mon, 01 Feb 2016 12:20:58 +0000 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\)) Subject: Re: [VOTE] Accept Joshua as an Apache Incubator Podling From: Jim Jagielski In-Reply-To: Date: Mon, 1 Feb 2016 07:20:57 -0500 Cc: "post@cs.jhu.edu" Content-Transfer-Encoding: quoted-printable Message-Id: <5080E4D1-08B3-430B-8ED9-40355CDDD54F@jaguNET.com> References: To: general@incubator.apache.org X-Mailer: Apple Mail (2.3112) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=comcast.net; s=q20140121; t=1454329258; bh=uHqLUwT+cfqFbJlXpoZ5pKOFlHeiZS9NMA25msk21b4=; h=Received:Received:Content-Type:Mime-Version:Subject:From:Date: Message-Id:To; b=U4wdcPGRinBPyGBMmxKy6kxIt0kzSrSshf6DswMzXlMHZaSnQEjsJPl7SCHnf22Od ue4G8je7HJCPceXxe4Ol/HJphIS9GJQSxCtA79aOPyc/WXLYYe1teasWAJsND2aTyK dzJ98GTb7H3ZFlOuDiI2RXnBRyWaFAmgXkzn5olQAozpgl9iShSxLBTjBy98ZNdv6R GgGBQrjMOAqvW8E/gML+TSzyxQAle8aL4mHEPzF9RnlBy6VcffVkaTvCisQlmwRqKT BMlCumdeJpmzhiDoc87IohqZ7TW+oaD9SnWBF8+pNDbRql//Ku0xvGcngUa3u6uPJi qUDl3yHBju2pQ== I know this is specifically called-out in the proposal, but it does seem worthy of further discussion. This has a pretty small list of initial committers, esp when one = considers how over-booked 2 of them appear to be. So, realistically, how active do both Chris and Lewis expect to be? > On Jan 30, 2016, at 3:00 PM, Mattmann, Chris A (3980) = wrote: >=20 > Hi Everyone, >=20 > OK the discussion is now completed. Please VOTE to accept Joshua > into the Apache Incubator. I=E2=80=99ll leave the VOTE open for at = least > the next 72 hours, with hopes to close it next Friday the 5th of > February, 2016. >=20 > [ ] +1 Accept Joshua as an Apache Incubator podling. > [ ] +0 Abstain. > [ ] -1 Don=E2=80=99t accept Joshua as an Apache Incubator podling = because.. >=20 > Of course, I am +1 on this. Please note VOTEs from Incubator PMC > members are binding but all are welcome to VOTE! >=20 > Cheers, > Chris >=20 > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattmann@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >=20 >=20 >=20 >=20 >=20 > -----Original Message----- > From: jpluser > Date: Tuesday, January 12, 2016 at 10:56 PM > To: "general@incubator.apache.org" > Cc: "post@cs.jhu.edu" > Subject: [DISCUSS] Apache Joshua Incubator Proposal - Machine = Translation > Toolkit >=20 >> Hi Everyone, >>=20 >> Please find attached for your viewing pleasure a proposed new = project, >> Apache Joshua, a statistical machine translation toolkit. The = proposal >> is in wiki draft form at: = https://wiki.apache.org/incubator/JoshuaProposal >>=20 >> Proposal text is copied below. I=E2=80=99ll leave the discussion open = for a week >> and we are interested in folks who would like to be initial = committers >> and mentors. Please discuss here on the thread. >>=20 >> Thanks! >>=20 >> Cheers, >> Chris (Champion) >>=20 >> =E2=80=94=E2=80=94=E2=80=94 >>=20 >> =3D Joshua Proposal =3D >>=20 >> =3D=3D Abstract =3D=3D >> [[joshua-decoder.org|Joshua]] is an open-source statistical machine >> translation toolkit. It includes a Java-based decoder for translating = with >> phrase-based, hierarchical, and syntax-based translation models, a >> Hadoop-based grammar extractor (Thrax), and an extensive set of tools = and >> scripts for training and evaluating new models from parallel text. >>=20 >> =3D=3D Proposal =3D=3D >> Joshua is a state of the art statistical machine translation system = that >> provides a number of features: >>=20 >> * Support for the two main paradigms in statistical machine = translation: >> phrase-based and hierarchical / syntactic. >> * A sparse feature API that makes it easy to add new feature = templates >> supporting millions of features >> * Native implementations of many tuners (MERT, MIRA, PRO, and = AdaGrad) >> * Support for lattice decoding, allowing upstream NLP tools to expose >> their hypothesis space to the MT system >> * An efficient representation for models, allowing for quick loading = of >> multi-gigabyte model files >> * Fast decoding speed (on par with Moses and mtplz) >> * Language packs =E2=80=94 precompiled models that allow the decoder = to be run as >> a black box >> * Thrax, a Hadoop-based tool for learning translation models from >> parallel text >> * A suite of tools for constructing new models for any language pair = for >> which sufficient training data exists >>=20 >> =3D=3D Background and Rationale =3D=3D >> A number of factors make this a good time for an Apache project = focused on >> machine translation (MT): the quality of MT output (for many language >> pairs); the average computing resources available on computers, = relative >> to the needs of MT systems; and the availability of a number of >> high-quality toolkits, together with a large base of researchers = working >> on them. >>=20 >> Over the past decade, machine translation (MT; the automatic = translation >> of one human language to another) has become a reality. The research = into >> statistical approaches to translation that began in the early = nineties, >> together with the availability of large amounts of training data, and >> better computing infrastructure, have all come together to produce >> translations results that are =E2=80=9Cgood enough=E2=80=9D for a = large set of language >> pairs and use cases. Free services like >> [[https://www.bing.com/translator|Bing Translator]] and >> [[https://translate.google.com|Google Translate]] have made these = services >> available to the average person through direct interfaces and through >> tools like browser plugins, and sites across the world with higher >> translation needs use them to translate their pages through = automatically. >>=20 >> MT does not require the infrastructure of large corporations in order = to >> produce feasible output. Machine translation can be = resource-intensive, >> but need not be prohibitively so. Disk and memory usage are mostly a >> matter of model size, which for most language pairs is a few = gigabytes at >> most, at which size models can provide coverage on the order of tens = or >> even hundreds of thousands of words in the input and output = languages. The >> computational complexity of the algorithms used to search for = translations >> of new sentences are typically linear in the number of words in the = input >> sentence, making it possible to run a translation engine on a = personal >> computer. >>=20 >> The research community has produced many different open source = translation >> projects for a range of programming languages and under a variety of >> licenses. These projects include the core =E2=80=9Cdecoder=E2=80=9D, = which takes a model >> and uses it to translate new sentences between the language pair the = model >> was defined for. They also typically include a large set of tools = that >> enable new models to be built from large sets of example translations >> (=E2=80=9Cparallel data=E2=80=9D) and monolingual texts. These = toolkits are usually built >> to support the agendas of the (largely) academic researchers that = build >> them: the repeated cycle of building new models, tuning model = parameters >> against development data, and evaluating them against held-out test = data, >> using standard metrics for testing the quality of MT output. >>=20 >> Together, these three factors=E2=80=94the quality of machine = translation output, >> the feasibility of translating on standard computers, and the = availability >> of tools to build models=E2=80=94make it reasonable for the end users = to use MT as >> a black-box service, and to run it on their personal machine. >>=20 >> These factors make it a good time for an organization with the status = of >> the Apache Foundation to host a machine translation project. >>=20 >> =3D=3D Current Status =3D=3D >> Joshua was originally ported from David Chiang=E2=80=99s Python = implementation of >> Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins >> University. The current version is maintained by Matt Post at Johns >> Hopkins=E2=80=99 Human Language Technology Center of Excellence. = Joshua has made >> many releases with a list of over 20 source code tags. The last = release of >> Joshua was 6.0.5 on November 5th, 2015. >>=20 >> =3D=3D Meritocracy =3D=3D >> The current developers are familiar with meritocratic open source >> development at Apache. Apache was chosen specifically because we want = to >> encourage this style of development for the project. >>=20 >> =3D=3D Community =3D=3D >> Joshua is used widely across the world. Perhaps its biggest (known) >> research / industrial user is the Amazon research group in Berlin. = Another >> user is the US Army Research Lab. No formal census has been = undertaken, >> but posts to the Joshua technical support mailing list, along with = the >> occasional contributions, suggest small research and academic = communities >> spread across the world, many of them in India. >>=20 >> During incubation, we will explicitly seek to increase our usage = across >> the board, including academic research, industry, and other end users >> interested in statistical machine translation. >>=20 >> =3D=3D Core Developers =3D=3D >> The current set of core developers is fairly small, having fallen = with the >> graduation from Johns Hopkins of some core student participants. = However, >> Joshua is used fairly widely, as mentioned above, and there remains a >> commitment from the principal researcher at Johns Hopkins to continue = to >> use and develop it. Joshua has seen a number of new community members >> become interested recently due to a potential for its projected use = in a >> number of ongoing DARPA projects such as XDATA and Memex. >>=20 >> =3D=3D Alignment =3D=3D >> Joshua is currently Copyright (c) 2015, Johns Hopkins University All >> rights reserved and licensed under BSD 2-clause license. It would of >> course be the intention to relicense this code under AL2.0 which = would >> permit expanded and increased use of the software within Apache = projects. >> There is currently an ongoing effort within the Apache Tika community = to >> utilize Joshua within Tika=E2=80=99s Translate API, see >> [[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]]. >>=20 >> =3D=3D Known Risks =3D=3D >>=20 >> =3D=3D=3D Orphaned products =3D=3D=3D >> At the moment, regular contributions are made by a single = contributor, the >> lead maintainer. He (Matt Post) plans to continue development for the = next >> few years, but it is still a single point of failure, since the = graduate >> students who worked on the project have moved on to jobs, mostly in >> industry. However, our goal is to help that process by growing the >> community in Apache, and at least in growing the community with users = and >> participants from NASA JPL. >>=20 >> =3D=3D=3D Inexperience with Open Source =3D=3D=3D >> The team both at Johns Hopkins and NASA JPL have experience with many = OSS >> software projects at Apache and elsewhere. We understand "how it = works" >> here at the foundation. >>=20 >>=20 >> =3D=3D Relationships with Other Apache Products =3D=3D >> Joshua includes dependences on Hadoop, and also is included as a = plugin in >> Apache Tika. We are also interested in coordinating with other = projects >> including Spark, and other projects needing MT services for language >> translation. >>=20 >> =3D=3D Developers =3D=3D >> Joshua only has one regular developer who is employed by Johns = Hopkins >> University. NASA JPL (Mattmann and McGibbney) have been contributing >> lately including a Brew formula and other contributions to the = project >> through the DARPA XDATA and Memex programs. >>=20 >> =3D=3D Documentation =3D=3D >> Documentation and publications related to Joshua can be found at >> joshua-decoder.org. The source for the Joshua documentation is = currently >> hosted on Github at >> https://github.com/joshua-decoder/joshua-decoder.github.com >>=20 >> =3D=3D Initial Source =3D=3D >> Current source resides at Github: github.com/joshua-decoder/joshua = (the >> main decoder and toolkit) and github.com/joshua-decoder/thrax (the = grammar >> extraction tool). >>=20 >> =3D=3D External Dependencies =3D=3D >> Joshua has a number of external dependencies. Only BerkeleyLM (Apache = 2.0) >> and KenLM (LGPG 2.1) are run-time decoder dependencies (one of which = is >> needed for translating sentences with pre-built models). The rest are >> dependencies for the build system and pipeline, used for constructing = and >> training new models from parallel text. >>=20 >> Apache projects: >> * Ant >> * Hadoop >> * Commons >> * Maven >> * Ivy >>=20 >> There are also a number of other open-source projects with various >> licenses that the project depends on both dynamically (runtime), and >> statically. >>=20 >> =3D=3D=3D GNU GPL 2 =3D=3D=3D >> * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/ >>=20 >> =3D=3D=3D LGPG 2.1 =3D=3D=3D >> * KenLM: github.com/kpu/kenlm >>=20 >> =3D=3D=3D Apache 2.0 =3D=3D=3D >> * BerkeleyLM: https://code.google.com/p/berkeleylm/ >>=20 >> =3D=3D=3D GNU GPL =3D=3D=3D >> * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html >>=20 >> =3D=3D Required Resources =3D=3D >> * Mailing Lists >> * private@joshua.incubator.apache.org >> * dev@joshua.incubator.apache.org >> * commits@joshua.incubator.apache.org >>=20 >> * Git Repos >> * https://git-wip-us.apache.org/repos/asf/joshua.git >>=20 >> * Issue Tracking >> * JIRA Joshua (JOSHUA) >>=20 >> * Continuous Integration >> * Jenkins builds on https://builds.apache.org/ >>=20 >> * Web >> * http://joshua.incubator.apache.org/ >> * wiki at http://cwiki.apache.org >>=20 >> =3D=3D Initial Committers =3D=3D >> The following is a list of the planned initial Apache committers (the >> active subset of the committers for the current repository on = Github). >>=20 >> * Matt Post (post@cs.jhu.edu) >> * Lewis John McGibbney (lewismc@apache.org) >> * Chris Mattmann (mattmann@apache.org) >>=20 >> =3D=3D Affiliations =3D=3D >>=20 >> * Johns Hopkins University >> * Matt Post >>=20 >> * NASA JPL=20 >> * Chris Mattmann >> * Lewis John McGibbney >>=20 >>=20 >> =3D=3D Sponsors =3D=3D >> =3D=3D=3D Champion =3D=3D=3D >> * Chris Mattmann (NASA/JPL) >>=20 >> =3D=3D=3D Nominated Mentors =3D=3D=3D >> * Paul Ramirez >> * Lewis John McGibbney >> * Chris Mattmann >>=20 >> =3D=3D Sponsoring Entity =3D=3D >> The Apache Incubator >>=20 >>=20 >>=20 >>=20 >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: chris.a.mattmann@nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>=20 >>=20 >>=20 >=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org > For additional commands, e-mail: general-help@incubator.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org