Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 99D5F18847 for ; Thu, 22 Oct 2015 10:44:10 +0000 (UTC) Received: (qmail 30232 invoked by uid 500); 22 Oct 2015 10:44:10 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 30191 invoked by uid 500); 22 Oct 2015 10:44:10 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 30178 invoked by uid 99); 22 Oct 2015 10:44:10 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Oct 2015 10:44:10 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 9CFF11A0875 for ; Thu, 22 Oct 2015 10:44:09 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.88 X-Spam-Level: ** X-Spam-Status: No, score=2.88 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=unsilo.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id G1s6-IZmt3vB for ; Thu, 22 Oct 2015 10:44:01 +0000 (UTC) Received: from mail-lf0-f53.google.com (mail-lf0-f53.google.com [209.85.215.53]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id ECA3942B35 for ; Thu, 22 Oct 2015 10:44:00 +0000 (UTC) Received: by lfaz124 with SMTP id z124so42866317lfa.1 for ; Thu, 22 Oct 2015 03:43:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=unsilo.com; s=unsilo; h=from:content-type:subject:message-id:date:to:mime-version; bh=boszfba7tMMQLgrpNEfzfVChufvfZXRac9nwfNpUaCU=; b=bLUP26isxvhhZ7yZXk/KUlEwoNKlGv6ZY/KDWe6D4zwo5dFllGY+9tn1WW2Ovpj87Q WTP/LbzICrAUwXEFTIa3mWpGxzaGXESElWGPnHpcaE7NaSeyKAWG7EJ+xRE0jSARlgfe aJKo80I4McOkOAehzBsu1JEHEKLNNlqXJZ+38= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:content-type:subject:message-id:date:to :mime-version; bh=boszfba7tMMQLgrpNEfzfVChufvfZXRac9nwfNpUaCU=; b=BFWDwmOoNNPBFx+dA8VWiYzzrIUNkvfw+JeUdMx74GQpdi2E3aPH5U2omA3v5C6ncC 2p9BCW4zkKBFeKykcqE/BYoDg5T8dKWM1ZaneNMv3lRoC6/cRTh+G3siPlwaRIfjpL5c JIZNWcJjVCN0mROgc3fSyzEit5Q41VeW2jFf4EChRJCSREKOApuDbtubeuz0EbynYMQ4 mKPquWmTMjH7KUKoD52IE+UD+3st4tMQP2V+epLOwdtNz9sHMlsuOZyQblNwdAxj268O wp1IbdrEKhKqMsscrpG5G/SJEG6QYR/8w156u+Assk9xClSOweLesGi4FxJYkPFsGIgl HcSw== X-Gm-Message-State: ALoCoQnUtIyihiFZb4b+7o5iqyuWQ6NklNxC9enWHoJGiVBNDe/qU3iUJS7+ytqYePpIys1dRuZF X-Received: by 10.112.61.226 with SMTP id t2mr7867095lbr.11.1445510639561; Thu, 22 Oct 2015 03:43:59 -0700 (PDT) Received: from [10.0.0.34] ([87.104.236.202]) by smtp.gmail.com with ESMTPSA id l82sm2282041lfg.0.2015.10.22.03.43.58 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 22 Oct 2015 03:43:58 -0700 (PDT) From: Manuel Ciosici Content-Type: multipart/alternative; boundary="Apple-Mail=_8D39D8D4-D3FE-4B31-9C78-A64D7A574C29" Subject: Re: UIMA Ruta not capturing some XML markup with attributes? Message-Id: <7399D83B-9081-47A3-B4C2-CAEAC0603927@unsilo.com> Date: Thu, 22 Oct 2015 12:43:57 +0200 To: user@uima.apache.org Mime-Version: 1.0 (Mac OS X Mail 9.0 \(3094\)) X-Mailer: Apple Mail (2.3094) --Apple-Mail=_8D39D8D4-D3FE-4B31-9C78-A64D7A574C29 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Hello Peter, I looked a bit a the new regular expression and there are still some = cases that aren=E2=80=99t caught. More specifically, it won=E2=80=99t = annotate XML tags that have a dash in their name, so tags such as: aren=E2=80=99t caught by the current regular expression. I=E2=80=99ve = changed the expression so that it works. What I did was change the \w+ = part from the tag name into \w[\w-]* since XML tag names can contain = dashes, but cannot start with dashes. I=E2=80=99ve also updated the unit = test so that there are tags with dashes and underscores and also one = non-tag. I=E2=80=99m attaching the SVN patch to this email. Manuel >Thanks Peter,=20 >=20 >The quotes are just normal quotes in the original source but the mail = software must have changed=20 >this. Sorry about that misunderstanding.=20 >=20 >Cheers=20 >Mario =20 >=20 >> On 21/10/2015, at 16.03, Peter Kl=C3=BCgl = wrote:=20 >> =20 >> Hi,=20 >> =20 >> I extended the pattern to support dashes, but not the other quotes. = This=20 >> can get arbitrary complex (and slow) if any combination of unicode=20 >> characters that look like quotes should be supported. I still think = that=20 >> this is not valid xml. Can you give me a link to the standard?=20 >> =20 >> It's maybe better to solve this in a specific use case before = applying=20 >> the seeder.=20 >> =20 >> Best,=20 >> =20 >> Peter=20 >> =20 >>> Am 20.10.2015 um 19:22 schrieb Mario Gazzo:=20 >>> I believe it should be extended since I think that a RUTA user would = expect that=20 >the MARKUP annotation indeed captures at least XML and HTML markup = properly. The examples=20 >are from a Pub Med Central XML file that follows the NISO JATS = specification so I will assume=20 >it is proper formatted XML without knowing all the details of the spec.=20= >>> =20 >>> We have managed to implement a crude workaround for now but let us = know when an improved=20 >version becomes available.=20 >>> =20 >>> Cheers=20 >>> Mario=20 >>> =20 >>>> On 20 Oct 2015, at 17:56 , Peter Kl=C3=BCgl = wrote:=20 >>>> =20 >>>> Hi Mario,=20 >>>> =20 >>>> yes, and the different quote also causes problems (are these = valid?).=20 >>>> =20 >>>> The MARUP annotation is not created by jflex like the other = annoations,=20 >>>> but by a postprocessing step using an regular epression. This = expression=20 >>>> does not cover theses cases (markupPattern in DefaultSeeder.java).=20= >>>> =20 >>>> Should we extend it?=20 >>>> =20 >>>> Best,=20 >>>> =20 >>>> Peter=20 >>>> =20 >>>>> Am 20.10.2015 um 17:26 schrieb Mario Gazzo:=20 >>>>> Hi Peter,=20 >>>>> =20 >>>>> RUTA doesn=E2=80=99t seem to capture some XML markup with = attributes. Here are=20 >some examples:=20 >>>>> =20 >>>>> =20 >>>>> =20 >>>>> =20 >>>>> The above markup examples are totally missing in the TokenSeed = annotations.=20 >I wonder whether it is related to the dash in the attribute names since = other markup without=20 >this appear to be captured.=20 >>>>> =20 >>>>> Can you confirm that the dash could cause the problem?=20 >>>>> =20 >>>>> Cheers=20 >>>>> Mario=20 >> =20 >= --Apple-Mail=_8D39D8D4-D3FE-4B31-9C78-A64D7A574C29 Content-Type: multipart/mixed; boundary="Apple-Mail=_1CBDD299-FE2D-4184-A179-2AB34BB838C6" --Apple-Mail=_1CBDD299-FE2D-4184-A179-2AB34BB838C6 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8
Hello Peter,
I looked a bit a the new regular expression and =
there are still some cases that aren=E2=80=99t caught. More =
specifically, it won=E2=80=99t annotate XML tags that have a dash in =
their name, so tags such as:
<first-name>
aren=E2=80=99t caught by the current regular expression. =
I=E2=80=99ve changed the expression so that it works. What I did was =
change the \w+ part from the tag name into \w[\w-]* since XML tag names =
can contain dashes, but cannot start with dashes. I=E2=80=99ve also =
updated the unit test so that there are tags with dashes and underscores =
and also one non-tag.
I=E2=80=99m attaching the SVN patch to this =
email.
Manuel
=

--Apple-Mail=_1CBDD299-FE2D-4184-A179-2AB34BB838C6
Content-Disposition: attachment;
	filename=MARKUP.patch
Content-Type: application/octet-stream;
	name="MARKUP.patch"
Content-Transfer-Encoding: quoted-printable

Index:=20=
ruta-core/src/main/java/org/apache/uima/ruta/seed/DefaultSeeder.java=0A=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=0A---=20=
ruta-core/src/main/java/org/apache/uima/ruta/seed/DefaultSeeder.java=09=
(revision=201709972)=0A+++=20=
ruta-core/src/main/java/org/apache/uima/ruta/seed/DefaultSeeder.java=09=
(working=20copy)=0A@@=20-40,7=20+40,7=20@@=0A=20=20=20public=20static=20=
final=20String=20seedType=20=3D=20"org.apache.uima.ruta.type.TokenSeed";=0D=
=0A=20=0D=0A=20=20=20private=20final=20Pattern=20markupPattern=20=3D=20=
Pattern=0D=0A-=20=20=20=20=20=20=20=20=20=20=
.compile("\\s]+))?)=
+\\s*|\\s*)/?>");=0D=0A+=20=20=20=20=20=20=20=20=20=20=
.compile("\\s=
]+))?)+\\s*|\\s*)/?>");=0D=0A=20=0D=0A=20=20=20public=20Type=20=
seed(String=20text,=20CAS=20cas)=20{=0D=0A=20=20=20=20=20Type=20result=20=
=3D=20null;=0D=0AIndex:=20=
ruta-core/src/test/java/org/apache/uima/ruta/seed/DefaultSeederTest.java=0A=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=0A---=20=
ruta-core/src/test/java/org/apache/uima/ruta/seed/DefaultSeederTest.java=09=
(revision=201709972)=0A+++=20=
ruta-core/src/test/java/org/apache/uima/ruta/seed/DefaultSeederTest.java=09=
(working=20copy)=0A@@=20-110,7=20+110,8=20@@=0A=20=20=20=20=20String=20=
document=20=3D=20""=0D=0A=20=20=20=20=20=20=20=20=20=20=20=20=20=
+=20""=20+=20""=
=0D=0A=20=20=20=20=20=20=20=20=20=20=20=20=20+=20""=20+=20""=0D=0A-=20=20=
=20=20=20=20=20=20=20=20=20=20+=20"";=0D=0A+=20=
=20=20=20=20=20=20=20=20=20=20=20+=20""=20+=20=
""=0D=0A+=09=09=09=09=09=09+=20=
"<-not-a-real-tag=20value=3D\"1\">"=20+=20"";=0D=
=0A=20=20=20=20=20String=20script=20=3D=20"RETAINTYPE(MARKUP);MARKUP{->=20=
T1};";=0D=0A=20=20=20=20=20CAS=20cas=20=3D=20null;=0D=0A=20=20=20=20=20=
try=20{=0D=0A@@=20-120,10=20+121,10=20@@=0A=20=20=20=20=20=20=20=
e.printStackTrace();=0D=0A=20=20=20=20=20}=0D=0A=20=0D=0A-=20=20=20=20=
RutaTestUtils.assertAnnotationsEquals(cas,=201,=206,=0D=0A+=20=20=20=20=
RutaTestUtils.assertAnnotationsEquals(cas,=201,=208,=0D=0A=20=20=20=20=20=
=20=20=20=20=20=20=20=20"",=20"",=0D=0A=
=20=20=20=20=20=20=20=20=20=20=20=20=20"",=20=
"",=20"",=0D=0A=
-=20=20=20=20=20=20=20=20=20=20=20=20"");=0D=0A=
+=20=20=20=20=20=20=20=20=20=20=20=20"",=20=
"",=20"");=0D=0A=
=20=0D=0A=20=20=20=20=20cas.release();=0D=0A=20=20=20}=0D=0A=

--Apple-Mail=_1CBDD299-FE2D-4184-A179-2AB34BB838C6
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

>Thanks Peter,=20
>=20
>The quotes are just normal quotes in the original source but the =
mail software must have changed=20
>this. Sorry about that misunderstanding.=20
>=20
>Cheers=20
>Mario =20
>=20
>> On 21/10/2015, at 16.03, Peter Kl=C3=BCgl <peter.kluegl@averbis.com> wrote:=20
>> =20
>> Hi,=20
>> =20
>> I extended the pattern to support dashes, but not the other =
quotes. This=20
>> can get arbitrary complex (and slow) if any combination of =
unicode=20
>> characters that look like quotes should be supported. I still =
think that=20
>> this is not valid xml. Can you give me a link to the standard?=20=

>> =20
>> It's maybe better to solve this in a specific use case before =
applying=20
>> the seeder.=20
>> =20
>> Best,=20
>> =20
>> Peter=20
>> =20
>>> Am 20.10.2015 um 19:22 schrieb Mario Gazzo:=20
>>> I believe it should be extended since I think that a RUTA =
user would expect that=20
>the MARKUP annotation indeed captures at least XML and HTML markup =
properly. The examples=20
>are from a Pub Med Central XML file that follows the NISO JATS =
specification so I will assume=20
>it is proper formatted XML without knowing all the details of the =
spec.=20
>>> =20
>>> We have managed to implement a crude workaround for now but =
let us know when an improved=20
>version becomes available.=20
>>> =20
>>> Cheers=20
>>> Mario=20
>>> =20
>>>> On 20 Oct 2015, at 17:56 , Peter Kl=C3=BCgl <peter.kluegl@averbis.com> wrote:=20
>>>> =20
>>>> Hi Mario,=20
>>>> =20
>>>> yes, and the different quote also causes problems (are =
these valid?).=20
>>>> =20
>>>> The MARUP annotation is not created by jflex like the =
other annoations,=20
>>>> but by a postprocessing step using an regular =
epression. This expression=20
>>>> does not cover theses cases (markupPattern in =
DefaultSeeder.java).=20
>>>> =20
>>>> Should we extend it?=20
>>>> =20
>>>> Best,=20
>>>> =20
>>>> Peter=20
>>>> =20
>>>>> Am 20.10.2015 um 17:26 schrieb Mario Gazzo:=20
>>>>> Hi Peter,=20
>>>>> =20
>>>>> RUTA doesn=E2=80=99t seem to capture some XML =
markup with attributes. Here are=20
>some examples:=20
>>>>> =20
>>>>> <xref ref-type=3D"bibr" =
rid=3D"b35-ehp0113-000220=E2=80=9D>=20
>>>>> <sec sec-type=3D"methods=E2=80=9D>=20
>>>>> =20
>>>>> The above markup examples are totally missing in =
the TokenSeed annotations.=20
>I wonder whether it is related to the dash in the attribute names =
since other markup without=20
>this appear to be captured.=20
>>>>> =20
>>>>> Can you confirm that the dash could cause the =
problem?=20
>>>>> =20
>>>>> Cheers=20
>>>>> Mario=20
>> =20
>
= --Apple-Mail=_1CBDD299-FE2D-4184-A179-2AB34BB838C6-- --Apple-Mail=_8D39D8D4-D3FE-4B31-9C78-A64D7A574C29--