opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joern Kottmann <>
Subject Re: Automated testing with public data
Date Wed, 15 Apr 2015 08:23:34 GMT
With publicly accessible data I mean a corpus you can somehow acquire,
opposed to the data you create on your own for a project.

All the corpora we support in the formats package are publicly accessible.
some you have to buy and for others you just have to sign some agreement.

A very interesting corpus for testing (and training models on) is OntoNotes.

Here is a link to the LDC entry:

You can get it for free (or for a small distribution fee) but you can't
just download it.
It would be great if the ASF could acquire this data set so we can share it
among the committers.

Is that what you mean with proprietary data?


On Wed, Apr 15, 2015 at 10:05 AM, Richard Eckart de Castilho <> wrote:

> On 15.04.2015, at 09:39, Joern Kottmann <> wrote:
> > Some data sets are publicly available but protected by copyright and just
> > can't be redistributed in
> > anyway. For this data we could get/buy a license and maybe restrict
> access
> > to it among the committers.
> That's what I'm saying ;) If you automatically download the data to a
> personal
> workstation during tests, you do not redistribute the data.
> For Jenkins builds, I just checked the Apache Jenkins and the "Workspace"
> does
> not seem to be publicly accessible. So stuff downloaded during tests there
> is
> also not made publicly available (redistributed) - it is only accessible to
> Apache developers that are logged in.
> IMHO only truely proprietary data that is not publicly accessible should be
> a problem, no?
> -- Richard

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message