incubator-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Post <p...@cs.jhu.edu>
Subject Re: [DISCUSS] Apache Joshua Incubator Proposal - Machine Translation Toolkit
Date Wed, 20 Jan 2016 15:51:12 GMT
I imagine so. Model building is very technical and resource intensive and something only a
few people will want or need to do. Working on and running the decoder (#2) should be the
much more common use case, and with the (included, Apache-licensed) Berkeley LM, that can
be done without the need for any external dependencies.


> On Jan 20, 2016, at 10:46 AM, Alex Harui <aharui@adobe.com> wrote:
> 
> External is good news.  I'm not sure how much leeway there is in the
> following quote from [1], but what percentage of your users are currently
> using an all-ASF-compatible set of projects?
> 
>    The question to ask yourself in this situation is:
>        * "Will the majority of users want to use my
>           product without adding the optional components?"
> 
> -Alex
> 
> [1] http://www.apache.org/legal/resolved.html
> 
> 
> On 1/20/16, 7:17 AM, "Matt Post" <post@cs.jhu.edu> wrote:
> 
>> The dependencies can be split into two kinds: ones required for building
>> new models, and ones needed by the decoder to translate new sentences
>> with a pre-built model (i.e., black-box translation with the language
>> packs).
>> 
>> 1. For building new models, you need a way to align the words between
>> sentences in parallel text. Both the aligners used by Joshua (GIZA++ and
>> the Berkeley aligner) are GPL of some form. These can be implemented as
>> external dependencies, or can be replaced with another aligner, like
>> fast_align (https://github.com/clab/fast_align), which is
>> Apache-licensed. There are many other options, in fact. So this should
>> not be a worry.
>> 
>> 2. For doing black-box translation, one needs to represent the language
>> model, which is very large. The best tool for this is KenLM
>> (github.com/kpu/kenlm), which is LGPL 2.1. There is also BerkeleyLM,
>> which is just as good for practical purposes and is Apache-licensed.
>> KenLM is C++ and is loaded via the JNI, whereas BerkeleyLM is written in
>> Java. I have moved to including BerkeleyLM in language packs, because I
>> can then include the Joshua-runtime, and people can translate without
>> even having to compile anything.
>> 
>> So in short, there are no hard dependencies on unfavorably-licensed
>> external projects.
>> 
>> matt
>> 
>> 
>> 
>> 
>>> On Jan 20, 2016, at 10:08 AM, Mattmann, Chris A (3980)
>>> <chris.a.mattmann@jpl.nasa.gov> wrote:
>>> 
>>> Hey Hen,
>>> 
>>> Matt Post who I believe is monitoring this list and who has
>>> been one of the key Joshua developers and I have discussed this
>>> and we believe that potentially GPL/LGPL dependencies can:
>>> 
>>> 1. be replaced with category-A or category-B alternatives. Matt
>>> mentioned one already to me which has slipped my mind.
>>> 2. be made in such a way that they are external tools and the
>>> bindings exist in Joshua to call those external tools (aka runtime
>>> deps akin to depending on a C compiler, etc.)
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattmann@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Mime
View raw message