uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mario Gazzo <mario.ga...@gmail.com>
Subject Re: Approach for keeping track of formatting associated with text views
Date Wed, 18 Feb 2015 22:03:36 GMT
Thanks. Looks interesting, seems that it could fit our use case. We will have a closer look
at it.

> On 18 Feb 2015, at 21:58 , Peter Klügl <pkluegl@uni-wuerzburg.de> wrote:
> Hi,
> you might want to take a look at two analysis engines of UIMA Ruta: HtmlAnnotator and
HtmlConverter [1]
> The former one creates annotations for html element and therefore also for xml tags.
The latter one creates a new view with only the plain text and adds existing annotations while
adapting their offsets to the new document.
> Best,
> Peter
> [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html
> Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
>> We are starting to use the UIMA framework for NL processing article text, which is
usually stored with metadata in some XML format. We need to extract text elements to be processed
by various NL analysis engines that only work with pure text but we also need to keep track
of the formatting information related to the processed text. It is in general also valuable
for us to be able to track every annotation back to the original XML to maintain provenance.
Before embarking on this I like to validate our approach with more experienced users since
this is the first application we are building with UIMA.
>> In the first step we would annotate every important element of the XML including
formatting elements in the body. We maintain some DOM-like relationships between the body
text and formatting annotations so that text formatting can be reproduced later with NLP annotations
in some article viewer.
>> Next we would in another AE produce a pure text view of the text annotations in the
XML view that need to be NL analysed. In this new text view we would annotate the different
text elements with references back to their counterpart in the original XML view so that we
can trace back positions in the original XML and the formatting relations. This of course
will require mapping NLP annotation offsets in the text view back to the XML view but the
information should then be there to make this possible.
>> This approach requires somewhat more handcrafted book keeping than we initially hoped
would be necessary. We haven’t been able to find any examples of how this is usually done
and the UIMA docs are vague regarding managing this kind of relationships across views. We
would therefore really like to know if there is a simpler and better approach.
>> Any feedback is greatly appreciated. Thanks.

View raw message