uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jens Grivolla <j+...@grivolla.net>
Subject Re: Approach for keeping track of formatting associated with text views
Date Mon, 09 Mar 2015 15:56:10 GMT
Hi Peter, while I don't think I will be using the HtmlConverter right away,
I would vote for using the length of the document annotation for
annotations that relate to the whole document (such as metadata).  That
makes them show up nicely in the CasEditor/Viewer and you could maintain it
in all segments when you split a CAS (e.g. with something based on the
SimpleTextSegmenter example).

-- Jens

On Sat, Mar 7, 2015 at 5:33 PM, Peter Klügl <pkluegl@uni-wuerzburg.de>
wrote:

> Hi,
>
> there is no way yet to customize this behavior. The HtmlConverter only
> retains annotation of a length > 0 since annoations with length == 0 are
> rather problematic and should be avoided.
>
> I can add a configuration parameter for keeping these annoations if you
> want (best open an issue for it). What should be the offsets of the
> annotations for elements in the head of the html document? 0, those of the
> first token or those of the document annotation?
>
> Best,
>
> Peter
>
>
> Am 06.03.2015 um 14:00 schrieb Mario Gazzo:
>
>  We conducted some experiments with both the HtmlAnnotator and the
>> HtmlConverter but we ran into an issue with the converter. It appears to
>> only convert tag annotations that surround or are inside the body tag.
>> Metadata elements like citations are ignored. The only way to get around
>> this seems to be by forking and modifying the codebase, which I like to
>> avoid. Both modules seem otherwise very useful to us but I am looking for a
>> better approach to solve this issue. Is there some way to customise this
>> behaviour without code modifications?
>>
>> Your input is appreciated, thanks.
>>
>>
>>  On 18 Feb 2015, at 23:03 , Mario Gazzo <mario.gazzo@gmail.com> wrote:
>>>
>>> Thanks. Looks interesting, seems that it could fit our use case. We will
>>> have a closer look at it.
>>>
>>>  On 18 Feb 2015, at 21:58 , Peter Klügl <pkluegl@uni-wuerzburg.de>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> you might want to take a look at two analysis engines of UIMA Ruta:
>>>> HtmlAnnotator and HtmlConverter [1]
>>>>
>>>> The former one creates annotations for html element and therefore also
>>>> for xml tags. The latter one creates a new view with only the plain text
>>>> and adds existing annotations while adapting their offsets to the new
>>>> document.
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>> [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html#
>>>> ugr.tools.ruta.ae.html
>>>>
>>>> Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
>>>>
>>>>> We are starting to use the UIMA framework for NL processing article
>>>>> text, which is usually stored with metadata in some XML format. We need
to
>>>>> extract text elements to be processed by various NL analysis engines
that
>>>>> only work with pure text but we also need to keep track of the formatting
>>>>> information related to the processed text. It is in general also valuable
>>>>> for us to be able to track every annotation back to the original XML
to
>>>>> maintain provenance. Before embarking on this I like to validate our
>>>>> approach with more experienced users since this is the first application
we
>>>>> are building with UIMA.
>>>>>
>>>>> In the first step we would annotate every important element of the XML
>>>>> including formatting elements in the body. We maintain some DOM-like
>>>>> relationships between the body text and formatting annotations so that
text
>>>>> formatting can be reproduced later with NLP annotations in some article
>>>>> viewer.
>>>>>
>>>>> Next we would in another AE produce a pure text view of the text
>>>>> annotations in the XML view that need to be NL analysed. In this new
text
>>>>> view we would annotate the different text elements with references back
to
>>>>> their counterpart in the original XML view so that we can trace back
>>>>> positions in the original XML and the formatting relations. This of course
>>>>> will require mapping NLP annotation offsets in the text view back to
the
>>>>> XML view but the information should then be there to make this possible.
>>>>>
>>>>> This approach requires somewhat more handcrafted book keeping than we
>>>>> initially hoped would be necessary. We haven’t been able to find any
>>>>> examples of how this is usually done and the UIMA docs are vague regarding
>>>>> managing this kind of relationships across views. We would therefore
really
>>>>> like to know if there is a simpler and better approach.
>>>>>
>>>>> Any feedback is greatly appreciated. Thanks.
>>>>>
>>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message