uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Klügl <peter.klu...@averbis.com>
Subject Re: UIMA Ruta not capturing some XML markup with attributes?
Date Wed, 21 Oct 2015 14:03:40 GMT
Hi,

I extended the pattern to support dashes, but not the other quotes. This
can get arbitrary complex (and slow) if any combination of unicode
characters that look like quotes should be supported. I still think that
this is not valid xml. Can you give me a link to the standard?

It's maybe better to solve this in a specific use case before applying
the seeder.

Best,

Peter

Am 20.10.2015 um 19:22 schrieb Mario Gazzo:
> I believe it should be extended since I think that a RUTA user would expect that the
MARKUP annotation indeed captures at least XML and HTML markup properly. The examples are
from a Pub Med Central XML file that follows the NISO JATS specification so I will assume
it is proper formatted XML without knowing all the details of the spec.
>
> We have managed to implement a crude workaround for now but let us know when an improved
version becomes available.
>
> Cheers
> Mario
>
>> On 20 Oct 2015, at 17:56 , Peter Klügl <peter.kluegl@averbis.com> wrote:
>>
>> Hi Mario,
>>
>> yes, and the different quote also causes problems (are these valid?).
>>
>> The MARUP annotation is not created by jflex like the other annoations,
>> but by a postprocessing step using an regular epression. This expression
>> does not cover theses cases (markupPattern in DefaultSeeder.java).
>>
>> Should we extend it?
>>
>> Best,
>>
>> Peter
>>
>> Am 20.10.2015 um 17:26 schrieb Mario Gazzo:
>>> Hi Peter,
>>>
>>> RUTA doesn’t seem to capture some XML markup with attributes. Here are some
examples:
>>>
>>> <xref ref-type="bibr" rid="b35-ehp0113-000220”>
>>> <sec sec-type="methods”>
>>>
>>> The above markup examples are totally missing in the TokenSeed annotations. I
wonder whether it is related to the dash in the attribute names since other markup without
this appear to be captured.
>>>
>>> Can you confirm that the dash could cause the problem?
>>>
>>> Cheers
>>> Mario


Mime
View raw message