opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <kottm...@gmail.com>
Subject Re: OpenNLP Annotations Proposal
Date Wed, 22 Jun 2011 16:45:13 GMT
Any other opinions on how we should store/exchange our
text with annotations?

As proposed up to now:
1. UIMA CAS based approach
2. Custom solution as proposed by Olivier

I think we should reach consensus here quickly
so we can start extending the proposal.

And if there are no objections I suggest that we include
the Corpus Refiner in the proposal as a web based tool
to update/verify/annotate a corpus.

Jörn

On 6/22/11 11:38 AM, Olivier Grisel wrote:
> 2011/6/22 Jörn Kottmann<kottmann@gmail.com>:
>> On 6/22/11 10:45 AM, Olivier Grisel wrote:
>>> I wind the UIMA CAS API much more complicated to work with than
>>> directly working with token-level concepts with the OpenNLP API (i.e.
>>> with arrays of Span). I haven't add a look at the opennlp-uima
>>> subproject though: you probably already have tooling and predefined
>>> type systems that makes interoperability with CAS instance less of a
>>> pain.
>> If you look at annotation tool they usually always give some flexibility to
>> the user
>> in terms what kind of annotations they are allowed to add. One thing I
>> always see is
>> as soon as they allow more complex annotations the tools and code which
>> handles to
>> annotations gets also complex. Have a look at Wordfreak or Gate.
>>
>> The CAS might be difficult to use first, but at least it works and is
>> very well tested. If we create a custom solution we might end up with
>> a similar complexity anyway.
>>
>> We would need to define a type system, but that is something we need
>> to do anyway independent of which way we implement it.
>> Maybe we even need to support different type systems for different corpora.
>> I guess we start with wikipedia based data, but one day we might want to
>> annotate an email or blog corpus.
>>
>> It is an interesting question how the type system should look, since we need
>> to
>> track where the annotations come from, and might even want some to be double
>> checked,
>> or need to annotate the disagreement of annotators.
> Point taken.
>


Mime
View raw message