opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Baldridge <>
Subject Re: OpenNLP Annotations Proposal
Date Fri, 10 Jun 2011 14:52:35 GMT
This looks great! I don't have time to look at this in great detail right
now, but am happy to give feedback on particular issues and questions.

Active learning would be nice to add eventually, but it has to be done with
great care, e.g. using uncertainty alone doesn't really work that well and
care needs to be taken with class imbalance etc. Random sampling is a good
starting point, and can be used while ironing out the details.

I can't remember if this has been discussed before, but does there need to
be a non-OpenNLP group which has a primary purpose of creating open
standardized datasets and annotation interfaces, etc?

It seems also we might be able to get some corporate sponsorship for
annotation, improvements to models, creation of resources for specific
languages, etc.

BTW, there is a lot that can be done to bootstrap POS-taggers from raw data
and the tags in Wiktionary, so if folks are interested in that I'm happy to
provide pointers.


On Fri, Jun 10, 2011 at 9:12 AM, Olivier Grisel <>wrote:

> Hi all,
> Here is a short report of the Berlin Buzzwords Semantic / NLP
> Hackathon that happened on Wednesday and yesterday at Neofonie and was
> related to this corpus annotation project.
> Basically we worked in small 2-3 people groups on various related topics.
> Hannes introduced a HTML / JS based tool named Walter to visualize and
> edit named entities and (optionally typed relations between those
> entities). Demo is here:
> Currently Walter walks with UIMA / XMI formatted files as input /
> output using a java servlet deployed on a tomcat server for instance.
> The plan is to adapt it to a corpus annotation validation / refinement
> pattern: feed it with a partially annotated corpus coming from the
> output of a OpenNLP pre-trained on the annotations extracted from
> Wikipedia using to bootstrap
> multilingual models.
> We would like to make a fast binary interface with keyboard shortcuts
> to focus one sentence at a time. If the user think that all the
> entities in the sentence are correctly annotated by the model, he/she
> press "space" and the sentence is marked validated and the focus moves
> to the next sentence. If the sentence is complete gibberish he/she can
> discard the sample by pressing "d". The user can also fix individual
> annotations using the mouse interface before validating the corrected
> sample.
> Up arrow and down arrow allow the user to move to focus the previous
> and next sentences (infinite AJAX / JSON scrolling over the corpus)
> without validating / discarding the corpus.
> When the focus is on a sample. The previous and next samples should be
> displayed before and after with a lower opacity level in read-only
> mode so as to provide the user with contextual information to make the
> right decision on the active sample.
> At the end of the session, the user can export all the validated
> samples as a new corpus formatted using the OpenNLP format.
> Unprocessed or explicitly discarded samples are not part of this
> refined version of the annotated corpus.
> To implement this we plan to rewrite the server side part of Walter in
> two parts:
> 1- a set of JAX-RS resources to convert corpus items + their
> annotations JSON objects on the client to / from OpenNLP NameSamples
> on the server. The first embryon for this part is here:
> 2- a POJO lib that uses OpenNLP to handle corpus loading, iterative
> validation (with validation / discarding / update + previous and next
> navigation) and serialization of the validated samples to a new
> OpenNLP formatted file that can be fed to train a new generation of
> the model. The work on this part has started here:
> Have a look at the test folder to see what's currently implemented. I
> would like to keep this in a separate maven artifact to be able to
> build a simple alternative CLI variant of the refiner interface that
> does not require to start a jetty or tomcat instance  / browser.
> For the client side, Hannes started to check that jquery should make
> it easier to implement the ajax callbacks  based on mouse + keyboard
> interaction.
> As for the licensing, Hannes told me that his employer should be
> willing to license the relevant parts (non specific to Fraunhoffer)
> Walter under a liberal license (MIT, BSD or ASL) so that it should be
> possible to contribute it to the ASF in the long term.
> Another group tested DUALIST: the tool looks really nice for the text
> classification case, less so for the NE detection case (the sample
> view is not very well suited for structured output and it requires to
> build Hearst features by hand, dualist does not do it automatically
> apparently).
> It should be possible to turn the Walter refiner into a real active
> learning annotation for structured output (NE and relation extraction)
> if we use the confidence level of the SequentialPerceptron of OpenNLP
> and use the less confident predictions as priority samples for the
> ordering of the sample to processing using the refined after pressing
> "space" or "d". The server could incrementally used the refined sample
> to update it's model and adjust the priority of the next batch of
> samples to refine from time to time as the perceptron algorithm is
> online (supports partial update of the model without restarting from
> scratch).
> Another group worked on named entity disambiguation using Solr
> MoreLikeThisHandler and indexes of context occurrences of those
> entities occurring in Wikipedia article. This work will probably be
> integrated in Stanbol directly and should be less interesting for the
> OpenNLP project. Also another group worked on adapting pignlproc to
> their own tools and hadoop infrastructure.
> Comments and pull-requests on the corpus-refiner prototype welcome. I
> plan to go on working on this project from time to time. AFAIK Hannes
> won't have time to work on the JS layer in the short term but it
> should be at least possible to have a first version of the command
> line based interface rather quickly.
> --
> Olivier
> -

Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message