www-legal-discuss mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Upayavira" ...@odoko.co.uk>
Subject Re: Fair-use data in svn
Date Fri, 05 Nov 2010 08:41:10 GMT
I think this might be a way to go. Ask here for folks that can
put you in touch with (hopefully) representative copyright
holders, and get grants for licenses to old content.

Now, if you plan to include it in SVN, use it for tests, and not
include it in releases, then the license grants need not be
AL2.0.

Upayavira

On Fri, 05 Nov 2010 09:21 +0100, "Santiago Gala"
<santiago.gala@gmail.com> wrote:

  CNN can probably give permission to use a set of 6 month old
  "regular" news for such purpose. If contacted through their PR
  people you could 'pay' with words about how important this is,
  or a joint release about them helping research (but talk with
  press@ before assuming it)

El 05/11/2010 01:08, "Benson Margulies"
<[1]bimargulies@gmail.com> escribió:
> On Thu, Nov 4, 2010 at 7:47 PM, Lawrence Rosen
<[2]lrosen@rosenlaw.com> wrote:
>> Benson, how about copying materials that are explicitly marked
"Creative Commons"? There must be enough of that stuff on the web
to collect into a test case.
>
> Here's a concrete example. Let's say that the job at hand is to
> extract useful text from webpages. You need to test this on the
news
> sites that people want to work with, like CNN. The inventory of
> 'Commons' pages is not representative.
>
> Another bit of concretude:
>
> Case 1: you have a representative collection of HTML pages, and
you
> use them to regress data extraction. Tika has avoided this by
> depending on a non-ASF component (boilerpipe).
>
> Case 2: you have, oh, 250,000 words of news, and you get people
to
> annotate them, and use them to train models. Whether there's
enough of
> the right stuff out there under CC is an open question.
>
>>
>> /Larry
>>
>>
>>
>>
>>> -----Original Message-----
>>> From: Benson Margulies [mailto:[3]bimargulies@gmail.com]
>>> Sent: Thursday, November 04, 2010 2:56 PM
>>> To: [4]legal-discuss@apache.org
>>> Subject: Re: Fair-use data in svn
>>>
>>> > There is no exception in copyright infringement law that
allows you
>>> to copy other people's copyrighted materials and distribute
them on an
>>> Apache website, no matter how upstanding the goals, without a
license.
>>> Ask permission first.
>>>
>>> It won't be on an apache web site. It will be in a zip file
in svn,
>>> read by (for example) a unit test. That seems a relevant
distinction
>>> to me, but YAAL, not me.
>>>
>>> >
>>> > If you intend to rely on a fair use defense, don't count on
it
>>> without analyzing the fair use factors carefully. I'll work
with you on
>>> that analysis if you can't find a better alternative for
generating
>>> test data.
>>> >
>>> > If these really are "miscellaneous" web pages, why can't
you create a
>>> test consisting of links to the actual pages? Must you copy
the pages
>>> themselves?
>>>
>>> You can't make a repeatable process that depends on ephemeral
content
>>> -- and this content is always ephemeral -- sitting there when
you want
>>> it.
>>>
>>>
>>> > /Larry
>>> >
>>> >
>>> >> -----Original Message-----
>>> >> From: Benson Margulies [mailto:[5]bimargulies@gmail.com]
>>> >> Sent: Thursday, November 04, 2010 9:07 AM
>>> >> To: [6]legal-discuss@apache.org
>>> >> Subject: Fair-use data in svn
>>> >>
>>> >> I write code in some areas where 'real world' textual data
is fuel.
>>> >> It's test cases. It's training corpora. It cannot be
replaced by
>>> >> constructed, test-tube, text that could be created under
the AL or
>>> >> some other 'class A' license.
>>> >>
>>> >> I'd like to contribute some of that data here at ASF. In
some cases,
>>> >> that would require checking in test case data that
consists of (for
>>> >> example) miscellaneous web pages grabbed with wget. In
other cases,
>>> it
>>> >> might consist of larger collections of text derived from
such pages.
>>> >>
>>> >> I would like to discover that this is acceptable, perhaps
with some
>>> >> caveats and requirements for NOTICE.
>>> >>
>>> >>
-----------------------------------------------------------------
---
>>> -
>>> >> To unsubscribe, e-mail:
[7]legal-discuss-unsubscribe@apache.org
>>> >> For additional commands, e-mail:
[8]legal-discuss-help@apache.org
>>> >
>>> >
>>> >
>>> >
-----------------------------------------------------------------
----
>>> > To unsubscribe, e-mail:
[9]legal-discuss-unsubscribe@apache.org
>>> > For additional commands, e-mail:
[10]legal-discuss-help@apache.org
>>> >
>>> >
>>>
>>>
-----------------------------------------------------------------
----
>>> To unsubscribe, e-mail:
[11]legal-discuss-unsubscribe@apache.org
>>> For additional commands, e-mail:
[12]legal-discuss-help@apache.org
>>
>>
>>
>>
-----------------------------------------------------------------
----
>> To unsubscribe, e-mail:
[13]legal-discuss-unsubscribe@apache.org
>> For additional commands, e-mail:
[14]legal-discuss-help@apache.org
>>
>>
>
>
-----------------------------------------------------------------
----
> To unsubscribe, e-mail:
[15]legal-discuss-unsubscribe@apache.org
> For additional commands, e-mail:
[16]legal-discuss-help@apache.org
>

References

1. mailto:bimargulies@gmail.com
2. mailto:lrosen@rosenlaw.com
3. mailto:bimargulies@gmail.com
4. mailto:legal-discuss@apache.org
5. mailto:bimargulies@gmail.com
6. mailto:legal-discuss@apache.org
7. mailto:legal-discuss-unsubscribe@apache.org
8. mailto:legal-discuss-help@apache.org
9. mailto:legal-discuss-unsubscribe@apache.org
  10. mailto:legal-discuss-help@apache.org
  11. mailto:legal-discuss-unsubscribe@apache.org
  12. mailto:legal-discuss-help@apache.org
  13. mailto:legal-discuss-unsubscribe@apache.org
  14. mailto:legal-discuss-help@apache.org
  15. mailto:legal-discuss-unsubscribe@apache.org
  16. mailto:legal-discuss-help@apache.org

Mime
View raw message