uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mario Gazzo <mario.ga...@gmail.com>
Subject Re: Approach for keeping track of formatting associated with text views
Date Tue, 10 Mar 2015 12:54:30 GMT
Thanks, I can of course open an issue for this.

I have been playing with a modified version of the HTMLConverter, which is why my reply is
delayed. I disabled the ‘inBody’-flag inside the HTMLConverterVisitor to get an idea of
what the effects might be. It pretty much did want I thought I wanted except that there is
no clear sentence boundaries between many of the metadata strings. Most of them are not really
meaningful to NL process but a few we would want to analyse but the sentence separation is
gone now. I have been looking at some of the conversion and line break options to get around
this but I haven't found a good approach yet. I really only want to introduce some sentence
separation like “. “ between different tag content outside the body.

I am not sure I understand your offset question. Would you mind elaborating this to me? Our
documents are in XML with a single body element containing HTML.


> On 07 Mar 2015, at 17:33 , Peter Klügl <pkluegl@uni-wuerzburg.de <mailto:pkluegl@uni-wuerzburg.de>>
wrote:
> 
> Hi,
> 
> there is no way yet to customize this behavior. The HtmlConverter only retains annotation
of a length > 0 since annoations with length == 0 are rather problematic and should be
avoided.
> 
> I can add a configuration parameter for keeping these annoations if you want (best open
an issue for it). What should be the offsets of the annotations for elements in the head of
the html document? 0, those of the first token or those of the document annotation?
> 
> Best,
> 
> Peter
> 
> 
> Am 06.03.2015 um 14:00 schrieb Mario Gazzo:
>> We conducted some experiments with both the HtmlAnnotator and the HtmlConverter but
we ran into an issue with the converter. It appears to only convert tag annotations that surround
or are inside the body tag. Metadata elements like citations are ignored. The only way to
get around this seems to be by forking and modifying the codebase, which I like to avoid.
Both modules seem otherwise very useful to us but I am looking for a better approach to solve
this issue. Is there some way to customise this behaviour without code modifications?
>> 
>> Your input is appreciated, thanks.
>> 
>> 
>>> On 18 Feb 2015, at 23:03 , Mario Gazzo <mario.gazzo@gmail.com <mailto:mario.gazzo@gmail.com>>
wrote:
>>> 
>>> Thanks. Looks interesting, seems that it could fit our use case. We will have
a closer look at it.
>>> 
>>>> On 18 Feb 2015, at 21:58 , Peter Klügl <pkluegl@uni-wuerzburg.de <mailto:pkluegl@uni-wuerzburg.de>>
wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> you might want to take a look at two analysis engines of UIMA Ruta: HtmlAnnotator
and HtmlConverter [1]
>>>> 
>>>> The former one creates annotations for html element and therefore also for
xml tags. The latter one creates a new view with only the plain text and adds existing annotations
while adapting their offsets to the new document.
>>>> 
>>>> Best,
>>>> 
>>>> Peter
>>>> 
>>>> [1] http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html
<http://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.ae.html>
>>>> 
>>>> Am 18.02.2015 um 21:46 schrieb Mario Gazzo:
>>>>> We are starting to use the UIMA framework for NL processing article text,
which is usually stored with metadata in some XML format. We need to extract text elements
to be processed by various NL analysis engines that only work with pure text but we also need
to keep track of the formatting information related to the processed text. It is in general
also valuable for us to be able to track every annotation back to the original XML to maintain
provenance. Before embarking on this I like to validate our approach with more experienced
users since this is the first application we are building with UIMA.
>>>>> 
>>>>> In the first step we would annotate every important element of the XML
including formatting elements in the body. We maintain some DOM-like relationships between
the body text and formatting annotations so that text formatting can be reproduced later with
NLP annotations in some article viewer.
>>>>> 
>>>>> Next we would in another AE produce a pure text view of the text annotations
in the XML view that need to be NL analysed. In this new text view we would annotate the different
text elements with references back to their counterpart in the original XML view so that we
can trace back positions in the original XML and the formatting relations. This of course
will require mapping NLP annotation offsets in the text view back to the XML view but the
information should then be there to make this possible.
>>>>> 
>>>>> This approach requires somewhat more handcrafted book keeping than we
initially hoped would be necessary. We haven’t been able to find any examples of how this
is usually done and the UIMA docs are vague regarding managing this kind of relationships
across views. We would therefore really like to know if there is a simpler and better approach.
>>>>> 
>>>>> Any feedback is greatly appreciated. Thanks.
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message