incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Jagielski <...@jaguNET.com>
Subject Re: [VOTE] Accept Joshua as an Apache Incubator Podling
Date Mon, 01 Feb 2016 16:41:46 GMT
OK, cool... Just thought the topic warranted some level of
discussion ;)

> On Feb 1, 2016, at 10:31 AM, Tom Barber <tom.barber@meteorite.bi> wrote:
> 
> Hello! I'm a code-aholic, you'll be getting regular commits from me.
> 
> Regards,
> 
> Tom
> 
> On Mon, Feb 1, 2016 at 3:20 PM, Mattmann, Chris A (3980) <
> chris.a.mattmann@jpl.nasa.gov> wrote:
> 
>> Hey Jim,
>> 
>> This is a valid concern, one that I hope is mediated by taking
>> however long it takes in Incubation to attract some new committers
>> to work on the project. Hopefully too you saw how long I took to
>> allow the discussion to occur and so forth.
>> 
>> Lewis has actively contributed to Joshua already - you can see -
>> via the HomeBrew package he created, see:
>> 
>> https://github.com/Homebrew/homebrew/pull/45746
>> 
>> 
>> You can see too it wasn’t something just recent or something
>> super quick it’s something he had to work at.
>> 
>> As for me, my involvement is going to be limited, but I am
>> actively pursuing Tika’s integration with Joshua as part of
>> TIKA-1343: http://issues.apache.org/jira/browse/TIKA-1343.
>> 
>> Finally my suspicion is that Tom, Henry and Tommaso will
>> contribute a lot as well.
>> 
>> Thanks for listening.
>> 
>> Cheers,
>> Chris
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Jim Jagielski <jim@jaguNET.com>
>> Reply-To: "general@incubator.apache.org" <general@incubator.apache.org>
>> Date: Monday, February 1, 2016 at 4:20 AM
>> To: "general@incubator.apache.org" <general@incubator.apache.org>
>> Cc: "post@cs.jhu.edu" <post@cs.jhu.edu>
>> Subject: Re: [VOTE] Accept Joshua as an Apache Incubator Podling
>> 
>>> I know this is specifically called-out in the proposal, but it
>>> does seem worthy of further discussion.
>>> 
>>> This has a pretty small list of initial committers, esp when one considers
>>> how over-booked 2 of them appear to be.
>>> 
>>> So, realistically, how active do both Chris and Lewis expect
>>> to be?
>>> 
>>>> On Jan 30, 2016, at 3:00 PM, Mattmann, Chris A (3980)
>>>> <chris.a.mattmann@jpl.nasa.gov> wrote:
>>>> 
>>>> Hi Everyone,
>>>> 
>>>> OK the discussion is now completed. Please VOTE to accept Joshua
>>>> into the Apache Incubator. I’ll leave the VOTE open for at least
>>>> the next 72 hours, with hopes to close it next Friday the 5th of
>>>> February, 2016.
>>>> 
>>>> [ ] +1 Accept Joshua as an Apache Incubator podling.
>>>> [ ] +0 Abstain.
>>>> [ ] -1 Don’t accept Joshua as an Apache Incubator podling because..
>>>> 
>>>> Of course, I am +1 on this. Please note VOTEs from Incubator PMC
>>>> members are binding but all are welcome to VOTE!
>>>> 
>>>> Cheers,
>>>> Chris
>>>> 
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: chris.a.mattmann@nasa.gov
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: jpluser <chris.a.mattmann@jpl.nasa.gov>
>>>> Date: Tuesday, January 12, 2016 at 10:56 PM
>>>> To: "general@incubator.apache.org" <general@incubator.apache.org>
>>>> Cc: "post@cs.jhu.edu" <post@cs.jhu.edu>
>>>> Subject: [DISCUSS] Apache Joshua Incubator Proposal - Machine
>>>> Translation
>>>> Toolkit
>>>> 
>>>>> Hi Everyone,
>>>>> 
>>>>> Please find attached for your viewing pleasure a proposed new project,
>>>>> Apache Joshua, a statistical machine translation toolkit. The proposal
>>>>> is in wiki draft form at:
>>>>> https://wiki.apache.org/incubator/JoshuaProposal
>>>>> 
>>>>> Proposal text is copied below. I’ll leave the discussion open for a
>>>>> week
>>>>> and we are interested in folks who would like to be initial committers
>>>>> and mentors. Please discuss here on the thread.
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> Cheers,
>>>>> Chris (Champion)
>>>>> 
>>>>> ———
>>>>> 
>>>>> = Joshua Proposal =
>>>>> 
>>>>> == Abstract ==
>>>>> [[joshua-decoder.org|Joshua]] is an open-source statistical machine
>>>>> translation toolkit. It includes a Java-based decoder for translating
>>>>> with
>>>>> phrase-based, hierarchical, and syntax-based translation models, a
>>>>> Hadoop-based grammar extractor (Thrax), and an extensive set of tools
>>>>> and
>>>>> scripts for training and evaluating new models from parallel text.
>>>>> 
>>>>> == Proposal ==
>>>>> Joshua is a state of the art statistical machine translation system
>>>>> that
>>>>> provides a number of features:
>>>>> 
>>>>> * Support for the two main paradigms in statistical machine
>>>>> translation:
>>>>> phrase-based and hierarchical / syntactic.
>>>>> * A sparse feature API that makes it easy to add new feature templates
>>>>> supporting millions of features
>>>>> * Native implementations of many tuners (MERT, MIRA, PRO, and AdaGrad)
>>>>> * Support for lattice decoding, allowing upstream NLP tools to expose
>>>>> their hypothesis space to the MT system
>>>>> * An efficient representation for models, allowing for quick loading
of
>>>>> multi-gigabyte model files
>>>>> * Fast decoding speed (on par with Moses and mtplz)
>>>>> * Language packs — precompiled models that allow the decoder to be
run
>>>>> as
>>>>> a black box
>>>>> * Thrax, a Hadoop-based tool for learning translation models from
>>>>> parallel text
>>>>> * A suite of tools for constructing new models for any language pair
>>>>> for
>>>>> which sufficient training data exists
>>>>> 
>>>>> == Background and Rationale ==
>>>>> A number of factors make this a good time for an Apache project
>>>>> focused on
>>>>> machine translation (MT): the quality of MT output (for many language
>>>>> pairs); the average computing resources available on computers,
>>>>> relative
>>>>> to the needs of MT systems; and the availability of a number of
>>>>> high-quality toolkits, together with a large base of researchers
>>>>> working
>>>>> on them.
>>>>> 
>>>>> Over the past decade, machine translation (MT; the automatic
>>>>> translation
>>>>> of one human language to another) has become a reality. The research
>>>>> into
>>>>> statistical approaches to translation that began in the early nineties,
>>>>> together with the availability of large amounts of training data, and
>>>>> better computing infrastructure, have all come together to produce
>>>>> translations results that are “good enough” for a large set of language
>>>>> pairs and use cases. Free services like
>>>>> [[https://www.bing.com/translator|Bing Translator]] and
>>>>> [[https://translate.google.com|Google Translate]] have made these
>>>>> services
>>>>> available to the average person through direct interfaces and through
>>>>> tools like browser plugins, and sites across the world with higher
>>>>> translation needs use them to translate their pages through
>>>>> automatically.
>>>>> 
>>>>> MT does not require the infrastructure of large corporations in order
>>>>> to
>>>>> produce feasible output. Machine translation can be resource-intensive,
>>>>> but need not be prohibitively so. Disk and memory usage are mostly a
>>>>> matter of model size, which for most language pairs is a few gigabytes
>>>>> at
>>>>> most, at which size models can provide coverage on the order of tens
or
>>>>> even hundreds of thousands of words in the input and output languages.
>>>>> The
>>>>> computational complexity of the algorithms used to search for
>>>>> translations
>>>>> of new sentences are typically linear in the number of words in the
>>>>> input
>>>>> sentence, making it possible to run a translation engine on a personal
>>>>> computer.
>>>>> 
>>>>> The research community has produced many different open source
>>>>> translation
>>>>> projects for a range of programming languages and under a variety of
>>>>> licenses. These projects include the core “decoder”, which takes
a
>>>>> model
>>>>> and uses it to translate new sentences between the language pair the
>>>>> model
>>>>> was defined for. They also typically include a large set of tools that
>>>>> enable new models to be built from large sets of example translations
>>>>> (“parallel data”) and monolingual texts. These toolkits are usually
>>>>> built
>>>>> to support the agendas of the (largely) academic researchers that build
>>>>> them: the repeated cycle of building new models, tuning model
>>>>> parameters
>>>>> against development data, and evaluating them against held-out test
>>>>> data,
>>>>> using standard metrics for testing the quality of MT output.
>>>>> 
>>>>> Together, these three factors—the quality of machine translation
>>>>> output,
>>>>> the feasibility of translating on standard computers, and the
>>>>> availability
>>>>> of tools to build models—make it reasonable for the end users to use
>>>>> MT as
>>>>> a black-box service, and to run it on their personal machine.
>>>>> 
>>>>> These factors make it a good time for an organization with the status
>>>>> of
>>>>> the Apache Foundation to host a machine translation project.
>>>>> 
>>>>> == Current Status ==
>>>>> Joshua was originally ported from David Chiang’s Python implementation
>>>>> of
>>>>> Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins
>>>>> University. The current version is maintained by Matt Post at Johns
>>>>> Hopkins’ Human Language Technology Center of Excellence. Joshua has
>>>>> made
>>>>> many releases with a list of over 20 source code tags. The last
>>>>> release of
>>>>> Joshua was 6.0.5 on November 5th, 2015.
>>>>> 
>>>>> == Meritocracy ==
>>>>> The current developers are familiar with meritocratic open source
>>>>> development at Apache. Apache was chosen specifically because we want
>>>>> to
>>>>> encourage this style of development for the project.
>>>>> 
>>>>> == Community ==
>>>>> Joshua is used widely across the world. Perhaps its biggest (known)
>>>>> research / industrial user is the Amazon research group in Berlin.
>>>>> Another
>>>>> user is the US Army Research Lab. No formal census has been undertaken,
>>>>> but posts to the Joshua technical support mailing list, along with the
>>>>> occasional contributions, suggest small research and academic
>>>>> communities
>>>>> spread across the world, many of them in India.
>>>>> 
>>>>> During incubation, we will explicitly seek to increase our usage across
>>>>> the board, including academic research, industry, and other end users
>>>>> interested in statistical machine translation.
>>>>> 
>>>>> == Core Developers ==
>>>>> The current set of core developers is fairly small, having fallen with
>>>>> the
>>>>> graduation from Johns Hopkins of some core student participants.
>>>>> However,
>>>>> Joshua is used fairly widely, as mentioned above, and there remains a
>>>>> commitment from the principal researcher at Johns Hopkins to continue
>>>>> to
>>>>> use and develop it. Joshua has seen a number of new community members
>>>>> become interested recently due to a potential for its projected use in
>>>>> a
>>>>> number of ongoing DARPA projects such as XDATA and Memex.
>>>>> 
>>>>> == Alignment ==
>>>>> Joshua is currently Copyright (c) 2015, Johns Hopkins University All
>>>>> rights reserved and licensed under BSD 2-clause license. It would of
>>>>> course be the intention to relicense this code under AL2.0 which would
>>>>> permit expanded and increased use of the software within Apache
>>>>> projects.
>>>>> There is currently an ongoing effort within the Apache Tika community
>>>>> to
>>>>> utilize Joshua within Tika’s Translate API, see
>>>>> [[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]].
>>>>> 
>>>>> == Known Risks ==
>>>>> 
>>>>> === Orphaned products ===
>>>>> At the moment, regular contributions are made by a single contributor,
>>>>> the
>>>>> lead maintainer. He (Matt Post) plans to continue development for the
>>>>> next
>>>>> few years, but it is still a single point of failure, since the
>>>>> graduate
>>>>> students who worked on the project have moved on to jobs, mostly in
>>>>> industry. However, our goal is to help that process by growing the
>>>>> community in Apache, and at least in growing the community with users
>>>>> and
>>>>> participants from NASA JPL.
>>>>> 
>>>>> === Inexperience with Open Source ===
>>>>> The team both at Johns Hopkins and NASA JPL have experience with many
>>>>> OSS
>>>>> software projects at Apache and elsewhere. We understand "how it works"
>>>>> here at the foundation.
>>>>> 
>>>>> 
>>>>> == Relationships with Other Apache Products ==
>>>>> Joshua includes dependences on Hadoop, and also is included as a
>>>>> plugin in
>>>>> Apache Tika. We are also interested in coordinating with other projects
>>>>> including Spark, and other projects needing MT services for language
>>>>> translation.
>>>>> 
>>>>> == Developers ==
>>>>> Joshua only has one regular developer who is employed by Johns Hopkins
>>>>> University. NASA JPL (Mattmann and McGibbney) have been contributing
>>>>> lately including a Brew formula and other contributions to the project
>>>>> through the DARPA XDATA and Memex programs.
>>>>> 
>>>>> == Documentation ==
>>>>> Documentation and publications related to Joshua can be found at
>>>>> joshua-decoder.org. The source for the Joshua documentation is
>>>>> currently
>>>>> hosted on Github at
>>>>> https://github.com/joshua-decoder/joshua-decoder.github.com
>>>>> 
>>>>> == Initial Source ==
>>>>> Current source resides at Github: github.com/joshua-decoder/joshua
>> (the
>>>>> main decoder and toolkit) and github.com/joshua-decoder/thrax (the
>>>>> grammar
>>>>> extraction tool).
>>>>> 
>>>>> == External Dependencies ==
>>>>> Joshua has a number of external dependencies. Only BerkeleyLM (Apache
>>>>> 2.0)
>>>>> and KenLM (LGPG 2.1) are run-time decoder dependencies (one of which
is
>>>>> needed for translating sentences with pre-built models). The rest are
>>>>> dependencies for the build system and pipeline, used for constructing
>>>>> and
>>>>> training new models from parallel text.
>>>>> 
>>>>> Apache projects:
>>>>> * Ant
>>>>> * Hadoop
>>>>> * Commons
>>>>> * Maven
>>>>> * Ivy
>>>>> 
>>>>> There are also a number of other open-source projects with various
>>>>> licenses that the project depends on both dynamically (runtime), and
>>>>> statically.
>>>>> 
>>>>> === GNU GPL 2 ===
>>>>> * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/
>>>>> 
>>>>> === LGPG 2.1 ===
>>>>> * KenLM: github.com/kpu/kenlm
>>>>> 
>>>>> === Apache 2.0 ===
>>>>> * BerkeleyLM: https://code.google.com/p/berkeleylm/
>>>>> 
>>>>> === GNU GPL ===
>>>>> * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html
>>>>> 
>>>>> == Required Resources ==
>>>>> * Mailing Lists
>>>>> * private@joshua.incubator.apache.org
>>>>> * dev@joshua.incubator.apache.org
>>>>> * commits@joshua.incubator.apache.org
>>>>> 
>>>>> * Git Repos
>>>>> * https://git-wip-us.apache.org/repos/asf/joshua.git
>>>>> 
>>>>> * Issue Tracking
>>>>> * JIRA Joshua (JOSHUA)
>>>>> 
>>>>> * Continuous Integration
>>>>> * Jenkins builds on https://builds.apache.org/
>>>>> 
>>>>> * Web
>>>>> * http://joshua.incubator.apache.org/
>>>>> * wiki at http://cwiki.apache.org
>>>>> 
>>>>> == Initial Committers ==
>>>>> The following is a list of the planned initial Apache committers (the
>>>>> active subset of the committers for the current repository on Github).
>>>>> 
>>>>> * Matt Post (post@cs.jhu.edu)
>>>>> * Lewis John McGibbney (lewismc@apache.org)
>>>>> * Chris Mattmann (mattmann@apache.org)
>>>>> 
>>>>> == Affiliations ==
>>>>> 
>>>>> * Johns Hopkins University
>>>>> * Matt Post
>>>>> 
>>>>> * NASA JPL
>>>>> * Chris Mattmann
>>>>> * Lewis John McGibbney
>>>>> 
>>>>> 
>>>>> == Sponsors ==
>>>>> === Champion ===
>>>>> * Chris Mattmann (NASA/JPL)
>>>>> 
>>>>> === Nominated Mentors ===
>>>>> * Paul Ramirez
>>>>> * Lewis John McGibbney
>>>>> * Chris Mattmann
>>>>> 
>>>>> == Sponsoring Entity ==
>>>>> The Apache Incubator
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Chris Mattmann, Ph.D.
>>>>> Chief Architect
>>>>> Instrument Software and Science Data Systems Section (398)
>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>> Office: 168-519, Mailstop: 168-527
>>>>> Email: chris.a.mattmann@nasa.gov
>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Adjunct Associate Professor, Computer Science Department
>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>>> For additional commands, e-mail: general-help@incubator.apache.org
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>>> For additional commands, e-mail: general-help@incubator.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message