uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manuel Ciosici <manuel.cios...@unsilo.com>
Subject Re: UIMA Ruta not capturing some XML markup with attributes?
Date Thu, 22 Oct 2015 10:43:57 GMT
Hello Peter,
I looked a bit a the new regular expression and there are still some cases that aren’t caught.
More specifically, it won’t annotate XML tags that have a dash in their name, so tags such
aren’t caught by the current regular expression. I’ve changed the expression so that it
works. What I did was change the \w+ part from the tag name into \w[\w-]* since XML tag names
can contain dashes, but cannot start with dashes. I’ve also updated the unit test so that
there are tags with dashes and underscores and also one non-tag.
I’m attaching the SVN patch to this email.

>Thanks Peter, 
>The quotes are just normal quotes in the original source but the mail software must have
>this. Sorry about that misunderstanding. 
>> On 21/10/2015, at 16.03, Peter Klügl <peter.kluegl@averbis.com> wrote: 
>> Hi, 
>> I extended the pattern to support dashes, but not the other quotes. This 
>> can get arbitrary complex (and slow) if any combination of unicode 
>> characters that look like quotes should be supported. I still think that 
>> this is not valid xml. Can you give me a link to the standard? 
>> It's maybe better to solve this in a specific use case before applying 
>> the seeder. 
>> Best, 
>> Peter 
>>> Am 20.10.2015 um 19:22 schrieb Mario Gazzo: 
>>> I believe it should be extended since I think that a RUTA user would expect that

>the MARKUP annotation indeed captures at least XML and HTML markup properly. The examples

>are from a Pub Med Central XML file that follows the NISO JATS specification so I will
>it is proper formatted XML without knowing all the details of the spec. 
>>> We have managed to implement a crude workaround for now but let us know when
an improved 
>version becomes available. 
>>> Cheers 
>>> Mario 
>>>> On 20 Oct 2015, at 17:56 , Peter Klügl <peter.kluegl@averbis.com>
>>>> Hi Mario, 
>>>> yes, and the different quote also causes problems (are these valid?). 
>>>> The MARUP annotation is not created by jflex like the other annoations, 
>>>> but by a postprocessing step using an regular epression. This expression

>>>> does not cover theses cases (markupPattern in DefaultSeeder.java). 
>>>> Should we extend it? 
>>>> Best, 
>>>> Peter 
>>>>> Am 20.10.2015 um 17:26 schrieb Mario Gazzo: 
>>>>> Hi Peter, 
>>>>> RUTA doesn’t seem to capture some XML markup with attributes. Here
>some examples: 
>>>>> <xref ref-type="bibr" rid="b35-ehp0113-000220”> 
>>>>> <sec sec-type="methods”> 
>>>>> The above markup examples are totally missing in the TokenSeed annotations.

>I wonder whether it is related to the dash in the attribute names since other markup without

>this appear to be captured. 
>>>>> Can you confirm that the dash could cause the problem? 
>>>>> Cheers 
>>>>> Mario 
View raw message