uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Klügl <peter.klu...@averbis.com>
Subject Re: UIMA Ruta not capturing some XML markup with attributes?
Date Thu, 22 Oct 2015 10:50:02 GMT
Hi Manuel,

oh yes, forgot about the element name. Thank you for the patch, I will
integrate it.
The common procedure would be to attach the patch to a jira issue. I
will take care of it, but you are of course also welcome to attach it :-)

Best,

Peter

Am 22.10.2015 um 12:43 schrieb Manuel Ciosici:
> Hello Peter,
> I looked a bit a the new regular expression and there are still some
> cases that aren’t caught. More specifically, it won’t annotate XML
> tags that have a dash in their name, so tags such as:
> <first-name>
> aren’t caught by the current regular expression. I’ve changed the
> expression so that it works. What I did was change the \w+ part from
> the tag name into \w[\w-]* since XML tag names can contain dashes, but
> cannot start with dashes. I’ve also updated the unit test so that
> there are tags with dashes and underscores and also one non-tag.
> I’m attaching the SVN patch to this email.
> Manuel
>
>
> >Thanks Peter, > >The quotes are just normal quotes in the original source but
the
> mail software must have changed >this. Sorry about that
> misunderstanding. > >Cheers >Mario > >> On 21/10/2015, at 16.03, Peter
> Klügl <peter.kluegl@averbis.com <mailto:peter.kluegl@averbis.com>>
> wrote: >> >> Hi, >> >> I extended the pattern to support dashes,
but
> not the other quotes. This >> can get arbitrary complex (and slow) if
> any combination of unicode >> characters that look like quotes should
> be supported. I still think that >> this is not valid xml. Can you
> give me a link to the standard? >> >> It's maybe better to solve this
> in a specific use case before applying >> the seeder. >> >> Best, >>
> >> Peter >> >>> Am 20.10.2015 um 19:22 schrieb Mario Gazzo: >>>
I
> believe it should be extended since I think that a RUTA user would
> expect that >the MARKUP annotation indeed captures at least XML and
> HTML markup properly. The examples >are from a Pub Med Central XML
> file that follows the NISO JATS specification so I will assume >it is
> proper formatted XML without knowing all the details of the spec. >>>
> >>> We have managed to implement a crude workaround for now but let us
> know when an improved >version becomes available. >>> >>> Cheers
>>>
> Mario >>> >>>> On 20 Oct 2015, at 17:56 , Peter Klügl
> <peter.kluegl@averbis.com <mailto:peter.kluegl@averbis.com>> wrote:
> >>>> >>>> Hi Mario, >>>> >>>> yes, and
the different quote also causes
> problems (are these valid?). >>>> >>>> The MARUP annotation is
not
> created by jflex like the other annoations, >>>> but by a
> postprocessing step using an regular epression. This expression >>>>
> does not cover theses cases (markupPattern in DefaultSeeder.java).
> >>>> >>>> Should we extend it? >>>> >>>>
Best, >>>> >>>> Peter >>>>
> >>>>> Am 20.10.2015 um 17:26 schrieb Mario Gazzo: >>>>>
Hi Peter,
> >>>>> >>>>> RUTA doesn’t seem to capture some XML markup
with
> attributes. Here are >some examples: >>>>> >>>>> <xref
ref-type="bibr"
> rid="b35-ehp0113-000220”> >>>>> <sec sec-type="methods”>
>>>>> >>>>>
> The above markup examples are totally missing in the TokenSeed
> annotations. >I wonder whether it is related to the dash in the
> attribute names since other markup without >this appear to be
> captured. >>>>> >>>>> Can you confirm that the dash could
cause the
> problem? >>>>> >>>>> Cheers >>>>> Mario
>> >


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message