opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Kottmann <>
Subject Re: Coref problem
Date Thu, 17 Nov 2011 10:48:11 GMT
On 11/17/11 11:32 AM, Aliaksandr Autayeu wrote:
>> We shouldn't replace JWNL with a newer version,
>> because we currently don't have the ability to train
>> or evaluate the coref component.
> +1. Having tests coverage eases many things, refactoring and development
> included :)
> This is a big issue for us because that also blocks
>> other changes and updates to the code itself,
>> e.g. the cleanups Aliaksandr contributed.
>> What we need here is a plan how we can get the coref component
>> into a state which makes it possible to develop it in a community.
>> If we don't find a way to resolve this I think we should move the coref
>> stuff
>> to the sandbox and leave it there until we have some training data.
> In my experience doing things like this is almost equal to deleting the
> piece of code altogether. On the other side, if there is no developer,
> actively using and developing this piece, having corpora, tests, etc,
> others might not have enough incentives.

That is already the situation the developer who wrote doesn't support it 
The only way to get it alive again would be to get the training and 
evaluation running.
If we have that, it will be possible to continue to work on it, and 
people can start using
it. The code itself is easy to understand and I have a good idea of how 
it works.

In the current state it really blocks the development of a few things.

> Another option would be label enough wikinews data, so we are able to
> train it.
> How much exactly is this "enough"? And what's the annotation UI? This also
> might be a good option to improve the annotation tools. I might be
> interested in pursuing this option (only if the corpus produced will be
> under a free license), mainly to learn :) but I would need some help and
> supervision.

We are discussing to do a wikinews crowd sourcing project to label
training data for all components in OpenNLP.

I once wrote a proposal to communicate this idea:

Currently we have a first version of the Corpus Server and plugins
for the UIMA Cas Editor (an annotation tool) to access articles in the 
Corpus Server and
an OpenNLP Plugin which can help with doing sentence detection,
tokenization and NER (could be extended with support for coref).

These tools are all located in the sandbox.

I am currently using them to run a private annotation project, and
therefore have time to work on them.


View raw message