opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <>
Subject Re: OpenNLP Annotations Proposal
Date Thu, 23 Jun 2011 07:29:24 GMT
On 6/22/11 7:53 PM, Olivier Grisel wrote:
>> We can also fix by having an option to delete "garbage" texts from the
>> corpus.
> Yes, discarding a whole CAS. But if the CAS is document level instead
> of sentence level, that might be an issue.

It depends, if the whole article is in such a bad condition that 
annotating it does not
make sense it should be discarded. If only a small part of the article 
cannot be annotated,
the annotator can skip over this part.
>> What other kind of data do you think we should store outside the CAses?
> If we ignore the Sofa editing use case, probably nothing.
+1, to do that for now.

>>> Also do you know of a good database for storing CAS? For instance does
>>> there exist an Apache CouchDB CASConsumer + CollectionReader? Or maybe
>>> a JDCB CASConsumer + CollectionReader that we could use with Apache
>>> Derby for instance?
>> I did a couple of tests with HBase and it was very easy to store 100M of
>> CASes,
>> anyway we do not really need to scale to that huge amounts, so I believe an
>> NoSQL or relational database would be just fine.
> I am -1 for HBase as it requires to setup a Hadoop cluster to run. As
> we target human annotators, we won't have terabytes of text data
> anyway and all data will probably fit in memory in most cases. I was
> thinking about using a DB to be able to handle concurrent editing by
> several annotators (+ ability to do search in the Sofa content) in a
> simple way.

Yeah, it does not seem important which DB we use, since most will
just work well for us.

I believe concurrent editing is more a question of the data model we choose
and to support search I would use something Lucene based instead of the 
some DBs might have.

For training it is also important that we can iterate
over all items in a reasonable time.

I actually like BigTables Column Family model because
it is easy to store a sofa plus feature structures in the columns, iterating
is fast and it can be scaled to huge amounts of data if needed.

Anyway, maybe it would be good to start with derby and just store XMI 
files in
it, what do you think?


View raw message