incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henri Yandell <bay...@apache.org>
Subject Re: [VOTE] Accept Joshua as an Apache Incubator Podling
Date Sun, 31 Jan 2016 04:58:27 GMT
+1 (non-binding).

On Sat, Jan 30, 2016 at 5:45 PM, Luke Han <luke.hq@gmail.com> wrote:

> +1 non-binding
>
>
> Best Regards!
> ---------------------
>
> Luke Han
>
> On Sun, Jan 31, 2016 at 5:27 AM, Tom Barber <tom.barber@meteorite.bi>
> wrote:
>
> > +1 binding
> >
> > Should be a very interesting project!
> >
> > On Sat, Jan 30, 2016 at 8:05 PM, Ashish <paliwalashish@gmail.com> wrote:
> >
> > > + (non-binding)
> > >
> > > On Sat, Jan 30, 2016 at 12:00 PM, Mattmann, Chris A (3980)
> > > <chris.a.mattmann@jpl.nasa.gov> wrote:
> > > > Hi Everyone,
> > > >
> > > > OK the discussion is now completed. Please VOTE to accept Joshua
> > > > into the Apache Incubator. I’ll leave the VOTE open for at least
> > > > the next 72 hours, with hopes to close it next Friday the 5th of
> > > > February, 2016.
> > > >
> > > > [ ] +1 Accept Joshua as an Apache Incubator podling.
> > > > [ ] +0 Abstain.
> > > > [ ] -1 Don’t accept Joshua as an Apache Incubator podling because..
> > > >
> > > > Of course, I am +1 on this. Please note VOTEs from Incubator PMC
> > > > members are binding but all are welcome to VOTE!
> > > >
> > > > Cheers,
> > > > Chris
> > > >
> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > Chris Mattmann, Ph.D.
> > > > Chief Architect
> > > > Instrument Software and Science Data Systems Section (398)
> > > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > > Office: 168-519, Mailstop: 168-527
> > > > Email: chris.a.mattmann@nasa.gov
> > > > WWW:  http://sunset.usc.edu/~mattmann/
> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > Adjunct Associate Professor, Computer Science Department
> > > > University of Southern California, Los Angeles, CA 90089 USA
> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: jpluser <chris.a.mattmann@jpl.nasa.gov>
> > > > Date: Tuesday, January 12, 2016 at 10:56 PM
> > > > To: "general@incubator.apache.org" <general@incubator.apache.org>
> > > > Cc: "post@cs.jhu.edu" <post@cs.jhu.edu>
> > > > Subject: [DISCUSS] Apache Joshua Incubator Proposal - Machine
> > Translation
> > > > Toolkit
> > > >
> > > >>Hi Everyone,
> > > >>
> > > >>Please find attached for your viewing pleasure a proposed new
> project,
> > > >>Apache Joshua, a statistical machine translation toolkit. The
> proposal
> > > >>is in wiki draft form at:
> > > https://wiki.apache.org/incubator/JoshuaProposal
> > > >>
> > > >>Proposal text is copied below. I’ll leave the discussion open for
a
> > week
> > > >>and we are interested in folks who would like to be initial
> committers
> > > >>and mentors. Please discuss here on the thread.
> > > >>
> > > >>Thanks!
> > > >>
> > > >>Cheers,
> > > >>Chris (Champion)
> > > >>
> > > >>———
> > > >>
> > > >>= Joshua Proposal =
> > > >>
> > > >>== Abstract ==
> > > >>[[joshua-decoder.org|Joshua]] is an open-source statistical machine
> > > >>translation toolkit. It includes a Java-based decoder for translating
> > > with
> > > >>phrase-based, hierarchical, and syntax-based translation models, a
> > > >>Hadoop-based grammar extractor (Thrax), and an extensive set of tools
> > and
> > > >>scripts for training and evaluating new models from parallel text.
> > > >>
> > > >>== Proposal ==
> > > >>Joshua is a state of the art statistical machine translation system
> > that
> > > >>provides a number of features:
> > > >>
> > > >> * Support for the two main paradigms in statistical machine
> > translation:
> > > >>phrase-based and hierarchical / syntactic.
> > > >> * A sparse feature API that makes it easy to add new feature
> templates
> > > >>supporting millions of features
> > > >> * Native implementations of many tuners (MERT, MIRA, PRO, and
> AdaGrad)
> > > >> * Support for lattice decoding, allowing upstream NLP tools to
> expose
> > > >>their hypothesis space to the MT system
> > > >> * An efficient representation for models, allowing for quick loading
> > of
> > > >>multi-gigabyte model files
> > > >> * Fast decoding speed (on par with Moses and mtplz)
> > > >> * Language packs — precompiled models that allow the decoder to
be
> run
> > > as
> > > >>a black box
> > > >> * Thrax, a Hadoop-based tool for learning translation models from
> > > >>parallel text
> > > >> * A suite of tools for constructing new models for any language pair
> > for
> > > >>which sufficient training data exists
> > > >>
> > > >>== Background and Rationale ==
> > > >>A number of factors make this a good time for an Apache project
> focused
> > > on
> > > >>machine translation (MT): the quality of MT output (for many language
> > > >>pairs); the average computing resources available on computers,
> > relative
> > > >>to the needs of MT systems; and the availability of a number of
> > > >>high-quality toolkits, together with a large base of researchers
> > working
> > > >>on them.
> > > >>
> > > >>Over the past decade, machine translation (MT; the automatic
> > translation
> > > >>of one human language to another) has become a reality. The research
> > into
> > > >>statistical approaches to translation that began in the early
> nineties,
> > > >>together with the availability of large amounts of training data, and
> > > >>better computing infrastructure, have all come together to produce
> > > >>translations results that are “good enough” for a large set of
> language
> > > >>pairs and use cases. Free services like
> > > >>[[https://www.bing.com/translator|Bing Translator]] and
> > > >>[[https://translate.google.com|Google Translate]] have made these
> > > services
> > > >>available to the average person through direct interfaces and through
> > > >>tools like browser plugins, and sites across the world with higher
> > > >>translation needs use them to translate their pages through
> > > automatically.
> > > >>
> > > >>MT does not require the infrastructure of large corporations in order
> > to
> > > >>produce feasible output. Machine translation can be
> resource-intensive,
> > > >>but need not be prohibitively so. Disk and memory usage are mostly
a
> > > >>matter of model size, which for most language pairs is a few
> gigabytes
> > at
> > > >>most, at which size models can provide coverage on the order of tens
> or
> > > >>even hundreds of thousands of words in the input and output
> languages.
> > > The
> > > >>computational complexity of the algorithms used to search for
> > > translations
> > > >>of new sentences are typically linear in the number of words in the
> > input
> > > >>sentence, making it possible to run a translation engine on a
> personal
> > > >>computer.
> > > >>
> > > >>The research community has produced many different open source
> > > translation
> > > >>projects for a range of programming languages and under a variety of
> > > >>licenses. These projects include the core “decoder”, which takes
a
> > model
> > > >>and uses it to translate new sentences between the language pair the
> > > model
> > > >>was defined for. They also typically include a large set of tools
> that
> > > >>enable new models to be built from large sets of example translations
> > > >>(“parallel data”) and monolingual texts. These toolkits are usually
> > built
> > > >>to support the agendas of the (largely) academic researchers that
> build
> > > >>them: the repeated cycle of building new models, tuning model
> > parameters
> > > >>against development data, and evaluating them against held-out test
> > data,
> > > >>using standard metrics for testing the quality of MT output.
> > > >>
> > > >>Together, these three factors—the quality of machine translation
> > output,
> > > >>the feasibility of translating on standard computers, and the
> > > availability
> > > >>of tools to build models—make it reasonable for the end users to
use
> MT
> > > as
> > > >>a black-box service, and to run it on their personal machine.
> > > >>
> > > >>These factors make it a good time for an organization with the status
> > of
> > > >>the Apache Foundation to host a machine translation project.
> > > >>
> > > >>== Current Status ==
> > > >>Joshua was originally ported from David Chiang’s Python
> implementation
> > of
> > > >>Hiero by Zhifei Li, while he was a Ph.D. student at Johns Hopkins
> > > >>University. The current version is maintained by Matt Post at Johns
> > > >>Hopkins’ Human Language Technology Center of Excellence. Joshua has
> > made
> > > >>many releases with a list of over 20 source code tags. The last
> release
> > > of
> > > >>Joshua was 6.0.5 on November 5th, 2015.
> > > >>
> > > >>== Meritocracy ==
> > > >>The current developers are familiar with meritocratic open source
> > > >>development at Apache. Apache was chosen specifically because we want
> > to
> > > >>encourage this style of development for the project.
> > > >>
> > > >>== Community ==
> > > >>Joshua is used widely across the world. Perhaps its biggest (known)
> > > >>research / industrial user is the Amazon research group in Berlin.
> > > Another
> > > >>user is the US Army Research Lab. No formal census has been
> undertaken,
> > > >>but posts to the Joshua technical support mailing list, along with
> the
> > > >>occasional contributions, suggest small research and academic
> > communities
> > > >>spread across the world, many of them in India.
> > > >>
> > > >>During incubation, we will explicitly seek to increase our usage
> across
> > > >>the board, including academic research, industry, and other end users
> > > >>interested in statistical machine translation.
> > > >>
> > > >>== Core Developers ==
> > > >>The current set of core developers is fairly small, having fallen
> with
> > > the
> > > >>graduation from Johns Hopkins of some core student participants.
> > However,
> > > >>Joshua is used fairly widely, as mentioned above, and there remains
a
> > > >>commitment from the principal researcher at Johns Hopkins to continue
> > to
> > > >>use and develop it. Joshua has seen a number of new community members
> > > >>become interested recently due to a potential for its projected use
> in
> > a
> > > >>number of ongoing DARPA projects such as XDATA and Memex.
> > > >>
> > > >>== Alignment ==
> > > >>Joshua is currently Copyright (c) 2015, Johns Hopkins University All
> > > >>rights reserved and licensed under BSD 2-clause license. It would of
> > > >>course be the intention to relicense this code under AL2.0 which
> would
> > > >>permit expanded and increased use of the software within Apache
> > projects.
> > > >>There is currently an ongoing effort within the Apache Tika community
> > to
> > > >>utilize Joshua within Tika’s Translate API, see
> > > >>[[https://issues.apache.org/jira/browse/TIKA-1343|TIKA-1343]].
> > > >>
> > > >>== Known Risks ==
> > > >>
> > > >>=== Orphaned products ===
> > > >>At the moment, regular contributions are made by a single
> contributor,
> > > the
> > > >>lead maintainer. He (Matt Post) plans to continue development for the
> > > next
> > > >>few years, but it is still a single point of failure, since the
> > graduate
> > > >>students who worked on the project have moved on to jobs, mostly in
> > > >>industry. However, our goal is to help that process by growing the
> > > >>community in Apache, and at least in growing the community with users
> > and
> > > >>participants from NASA JPL.
> > > >>
> > > >>=== Inexperience with Open Source ===
> > > >>The team both at Johns Hopkins and NASA JPL have experience with many
> > OSS
> > > >>software projects at Apache and elsewhere. We understand "how it
> works"
> > > >>here at the foundation.
> > > >>
> > > >>
> > > >>== Relationships with Other Apache Products ==
> > > >>Joshua includes dependences on Hadoop, and also is included as a
> plugin
> > > in
> > > >>Apache Tika. We are also interested in coordinating with other
> projects
> > > >>including Spark, and other projects needing MT services for language
> > > >>translation.
> > > >>
> > > >>== Developers ==
> > > >>Joshua only has one regular developer who is employed by Johns
> Hopkins
> > > >>University. NASA JPL (Mattmann and McGibbney) have been contributing
> > > >>lately including a Brew formula and other contributions to the
> project
> > > >>through the DARPA XDATA and Memex programs.
> > > >>
> > > >>== Documentation ==
> > > >>Documentation and publications related to Joshua can be found at
> > > >>joshua-decoder.org. The source for the Joshua documentation is
> > currently
> > > >>hosted on Github at
> > > >>https://github.com/joshua-decoder/joshua-decoder.github.com
> > > >>
> > > >>== Initial Source ==
> > > >>Current source resides at Github: github.com/joshua-decoder/joshua
> > (the
> > > >>main decoder and toolkit) and github.com/joshua-decoder/thrax (the
> > > grammar
> > > >>extraction tool).
> > > >>
> > > >>== External Dependencies ==
> > > >>Joshua has a number of external dependencies. Only BerkeleyLM (Apache
> > > 2.0)
> > > >>and KenLM (LGPG 2.1) are run-time decoder dependencies (one of which
> is
> > > >>needed for translating sentences with pre-built models). The rest are
> > > >>dependencies for the build system and pipeline, used for constructing
> > and
> > > >>training new models from parallel text.
> > > >>
> > > >>Apache projects:
> > > >> * Ant
> > > >> * Hadoop
> > > >> * Commons
> > > >> * Maven
> > > >> * Ivy
> > > >>
> > > >>There are also a number of other open-source projects with various
> > > >>licenses that the project depends on both dynamically (runtime), and
> > > >>statically.
> > > >>
> > > >>=== GNU GPL 2 ===
> > > >> * Berkeley Aligner: https://code.google.com/p/berkeleyaligner/
> > > >>
> > > >>=== LGPG 2.1 ===
> > > >> * KenLM: github.com/kpu/kenlm
> > > >>
> > > >>=== Apache 2.0 ===
> > > >> * BerkeleyLM: https://code.google.com/p/berkeleylm/
> > > >>
> > > >>=== GNU GPL ===
> > > >> * GIZA++: http://www.statmt.org/moses/giza/GIZA++.html
> > > >>
> > > >>== Required Resources ==
> > > >> * Mailing Lists
> > > >>   * private@joshua.incubator.apache.org
> > > >>   * dev@joshua.incubator.apache.org
> > > >>   * commits@joshua.incubator.apache.org
> > > >>
> > > >> * Git Repos
> > > >>   * https://git-wip-us.apache.org/repos/asf/joshua.git
> > > >>
> > > >> * Issue Tracking
> > > >>   * JIRA Joshua (JOSHUA)
> > > >>
> > > >> * Continuous Integration
> > > >>   * Jenkins builds on https://builds.apache.org/
> > > >>
> > > >> * Web
> > > >>   * http://joshua.incubator.apache.org/
> > > >>   * wiki at http://cwiki.apache.org
> > > >>
> > > >>== Initial Committers ==
> > > >>The following is a list of the planned initial Apache committers (the
> > > >>active subset of the committers for the current repository on
> Github).
> > > >>
> > > >> * Matt Post (post@cs.jhu.edu)
> > > >> * Lewis John McGibbney (lewismc@apache.org)
> > > >> * Chris Mattmann (mattmann@apache.org)
> > > >>
> > > >>== Affiliations ==
> > > >>
> > > >> * Johns Hopkins University
> > > >>   * Matt Post
> > > >>
> > > >> * NASA JPL
> > > >>   * Chris Mattmann
> > > >>   * Lewis John McGibbney
> > > >>
> > > >>
> > > >>== Sponsors ==
> > > >>=== Champion ===
> > > >> * Chris Mattmann (NASA/JPL)
> > > >>
> > > >>=== Nominated Mentors ===
> > > >> * Paul Ramirez
> > > >> * Lewis John McGibbney
> > > >> * Chris Mattmann
> > > >>
> > > >>== Sponsoring Entity ==
> > > >>The Apache Incubator
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >>Chris Mattmann, Ph.D.
> > > >>Chief Architect
> > > >>Instrument Software and Science Data Systems Section (398)
> > > >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > > >>Office: 168-519, Mailstop: 168-527
> > > >>Email: chris.a.mattmann@nasa.gov
> > > >>WWW:  http://sunset.usc.edu/~mattmann/
> > > >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >>Adjunct Associate Professor, Computer Science Department
> > > >>University of Southern California, Los Angeles, CA 90089 USA
> > > >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >>
> > > >>
> > > >>
> > > >
> > >
> > >
> > >
> > > --
> > > thanks
> > > ashish
> > >
> > > Blog: http://www.ashishpaliwal.com/blog
> > > My Photo Galleries: http://www.pbase.com/ashishpaliwal
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > > For additional commands, e-mail: general-help@incubator.apache.org
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message