uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mario Gazzo <mario.ga...@gmail.com>
Subject Re: UIMA Ruta not capturing some XML markup with attributes?
Date Wed, 21 Oct 2015 15:24:41 GMT
Thanks Peter,

The quotes are just normal quotes in the original source but the mail software must have changed
this. Sorry about that misunderstanding.

Cheers
Mario 

> On 21/10/2015, at 16.03, Peter Klügl <peter.kluegl@averbis.com> wrote:
> 
> Hi,
> 
> I extended the pattern to support dashes, but not the other quotes. This
> can get arbitrary complex (and slow) if any combination of unicode
> characters that look like quotes should be supported. I still think that
> this is not valid xml. Can you give me a link to the standard?
> 
> It's maybe better to solve this in a specific use case before applying
> the seeder.
> 
> Best,
> 
> Peter
> 
>> Am 20.10.2015 um 19:22 schrieb Mario Gazzo:
>> I believe it should be extended since I think that a RUTA user would expect that
the MARKUP annotation indeed captures at least XML and HTML markup properly. The examples
are from a Pub Med Central XML file that follows the NISO JATS specification so I will assume
it is proper formatted XML without knowing all the details of the spec.
>> 
>> We have managed to implement a crude workaround for now but let us know when an improved
version becomes available.
>> 
>> Cheers
>> Mario
>> 
>>> On 20 Oct 2015, at 17:56 , Peter Klügl <peter.kluegl@averbis.com> wrote:
>>> 
>>> Hi Mario,
>>> 
>>> yes, and the different quote also causes problems (are these valid?).
>>> 
>>> The MARUP annotation is not created by jflex like the other annoations,
>>> but by a postprocessing step using an regular epression. This expression
>>> does not cover theses cases (markupPattern in DefaultSeeder.java).
>>> 
>>> Should we extend it?
>>> 
>>> Best,
>>> 
>>> Peter
>>> 
>>>> Am 20.10.2015 um 17:26 schrieb Mario Gazzo:
>>>> Hi Peter,
>>>> 
>>>> RUTA doesn’t seem to capture some XML markup with attributes. Here are
some examples:
>>>> 
>>>> <xref ref-type="bibr" rid="b35-ehp0113-000220”>
>>>> <sec sec-type="methods”>
>>>> 
>>>> The above markup examples are totally missing in the TokenSeed annotations.
I wonder whether it is related to the dash in the attribute names since other markup without
this appear to be captured.
>>>> 
>>>> Can you confirm that the dash could cause the problem?
>>>> 
>>>> Cheers
>>>> Mario
> 

Mime
View raw message