www-legal-discuss mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benson Margulies (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LEGAL-157) May we include models without release the data the models were built from
Date Wed, 23 Jan 2013 23:45:12 GMT

    [ https://issues.apache.org/jira/browse/LEGAL-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13561237#comment-13561237

Benson Margulies commented on LEGAL-157:

Roy, I wish you would help me apply your writing to the specific set of situations at hand
here. All of these follow a pattern, and I think it's a pattern we're going to see, over and
over. And over at the Incubator I want to have this completely clear to people before they
launch podlings.

First we start with a body of material which is absolutely not Apache product. In the SpamAssassin
case, a curated collection of spam and ham. In the OpenNLP case, (e.g.) a ton of copyrighted
news articles hoovered up from the web. In the cTakes case, some medical records or the like.

Second, some people annotate that data. The annotation work might be Apache work or it may
not, depending. At SpamAssassin, the annotation (classification as spam or ham) is definitely
Apache work.

Third, software that is granted to or created at Apache is applied to this annotated data
to build a model. Frequently, the immediate result of this is something sort of readable --
a big file of numbers -- which is, in the end, compiled into a binary.

Finally, to get useful work out of the whole business, and end user need to feed this model
to the software from the project.

As I read your email, you'd keep all of the model pipeline outside of Apache, or at least
out of Apache primary svn and releases. That leaves me with two questions for you.

1) Another theme I've seen is stern remarks directed at Apache projects coordinating, storing,
or otherwise operating *outside of Apache*. This seems to produce a catch-22: An Apache project
can't undertake something like an NLP model inside, and they also can't undertake it outside.
I've felt that it should be OK for the members of an Apache project to organize something
like this as long as they make it clear that they results aren't an Apache result.

2) How do you analyze the SpamAssassin situation, in which the spam/ham collection lives on
Apache infrastructure, but not svn, with access limited to community members? By analogy to
build/test tools and other things that aren't in primary svn?

> May we include models without release the data the models were built from
> -------------------------------------------------------------------------
>                 Key: LEGAL-157
>                 URL: https://issues.apache.org/jira/browse/LEGAL-157
>             Project: Legal Discuss
>          Issue Type: Question
>            Reporter: James Joseph Masanz
> We would like to include with cTAKES machine learning models built from data that is
not publicly available.
> The models are contributed to Apache.
> The corpus of data used to build the models is not contributed to Apache.
> Can we include such models in a convenience binary?
> http://s.apache.org/cTAKES-models-Q012013
> - James Masanz

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

View raw message