opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Getting our first release out
Date Tue, 01 Feb 2011 21:45:58 GMT

On Feb 1, 2011, at 11:20 AM, Benson Margulies wrote:

> With somewhat mixed feelings, I've been following this discussion. In
> the interests of full disclosure, I'll explain the mixed feelings in a
> moment. I warmed up legal-discuss for you during the incubator
> discussion and learned some things.

What's the thread for this one?

> Based on my legal understanding, I feel fairly confident that models
> derived from textual corpora are not 'derived works' subject to the
> copyrights and licenses of the corpora. However, IANAL, and this needs
> to be explored. Some remarks on legal-discuss suggest that, in Europe,
> I may be completely wrong. Still, this is probably the *good* news.
> The less-good news is that, as a general principle, the ASF would not
> want a release to contain a binary artifact derived from sources hat
> cannot be released under the Apache license, or even obtained under
> the Apache license or something remotely like it. An even stronger
> principle is that the source materials must be available, period
> (e.g. not available only to LDC members or something).

This is the single most frustrating issue facing open source text tools to date.  It's why
I started the Open Relevance Project, but until we have enough of us willing to band together
and work on it, we will be stuck.

> The less bad news is that there is a precedent here: SpamAssassin. To
> train spam models, SpamAssassin has to collect and maintain large
> collections of materials that have restrictive licenses. The
> Foundation has decided that this is tolerable if these materials are
> kept on a Foundation server, and access to that granted to legitimate
> members of the development community, one by one. This avoids the
> spectre of 'publication' but permits open participation.

This is OK, but it discourages newbies from participating.

> The bottom line of the legal-discuss discussion was that this path
> was, broadly, available to OpenNLP. However, legal-discuss hates to
> discuss hypotheticals, so you won't get a definitive ruling until you
> ask a specific question. I recommend opening a JIRA on legal-discuss
> as a way to clarify that you need a clear and definitive ruling and
> not just an email food-fight.

Yes, we should start assembling a list of corpora, even so we at least have it for others
that come later and want to reproduce them.  In the meantime, I would agree that we can just
keep the models elsewhere.  We don't have to provide models.  They are a convenience for all
involved, but not a requirement in order to run.  I wonder how many people actually train
there own.  (BTW, we should update our website to point to older models, too.  They are really
hard to find unless you do some URL rewriting.)

View raw message