From user-return-8243-archive-asf-public=cust-asf.ponee.io@uima.apache.org  Wed Oct  9 20:19:25 2019
Return-Path: <user-return-8243-archive-asf-public=cust-asf.ponee.io@uima.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id E666B180645
	for <archive-asf-public@cust-asf.ponee.io>; Wed,  9 Oct 2019 22:19:24 +0200 (CEST)
Received: (qmail 63212 invoked by uid 500); 9 Oct 2019 20:19:24 -0000
Mailing-List: contact user-help@uima.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@uima.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@uima.apache.org>
List-Post: <mailto:user@uima.apache.org>
List-Id: <user.uima.apache.org>
Reply-To: user@uima.apache.org
Delivered-To: mailing list user@uima.apache.org
Received: (qmail 63200 invoked by uid 99); 9 Oct 2019 20:19:23 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Oct 2019 20:19:23 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 1A77DC2147
	for <user@uima.apache.org>; Wed,  9 Oct 2019 20:19:23 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 2.249
X-Spam-Level: **
X-Spam-Status: No, score=2.249 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1,
	HEADER_FROM_DIFFERENT_DOMAINS=0.249, HTML_MESSAGE=2,
	RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001,
	SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001]
	autolearn=disabled
Authentication-Results: spamd4-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=unsilo-ai.20150623.gappssmtp.com
Received: from mx1-ec2-va.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id NRQB1Hq9A5eq for <user@uima.apache.org>;
	Wed,  9 Oct 2019 20:19:21 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.160.172; helo=mail-qt1-f172.google.com; envelope-from=mj@unsilo.com; receiver=<UNKNOWN> 
Received: from mail-qt1-f172.google.com (mail-qt1-f172.google.com [209.85.160.172])
	by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id E4300BCA1F
	for <user@uima.apache.org>; Wed,  9 Oct 2019 20:19:20 +0000 (UTC)
Received: by mail-qt1-f172.google.com with SMTP id l51so5041685qtc.4
        for <user@uima.apache.org>; Wed, 09 Oct 2019 13:19:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=unsilo-ai.20150623.gappssmtp.com; s=20150623;
        h=from:mime-version:subject:date:references:to:in-reply-to:message-id;
        bh=yLqm9xFzqpl7FAtOnLWuT6v+ADtrYtUMtFoW0YMF1Kc=;
        b=fRtLpZzs0hV90BmyNfkGmKezCfzeitc8nHQ3oIP4h3yibC8kPVnMkr9RsEIjHkN0QN
         5bHLwqJbRGEwnbtCE9WfXnHJMtZrFWvoYXp0RpXfGbks7iLuwPwlGM6zFeTw2VZJX9mS
         HBySgCO6mx/3/M+s+eNmd9fs0m6wrVRXDbdwKV0hW55OybqMIzBLnME8geYVUGTaNdMm
         kliJYX1IcdBxBHzhZlbaSoLdSLrkddP7f/xeCCRwK203UTn99JlrN9sioSe0O9BiZokB
         dy+B6hu2LhdANaapoHg5iVzN0t34PF6TYcJT7Mn8kQUTPQEPEEtZVqO6HS2ZQp6v/wjh
         fm8Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:mime-version:subject:date:references:to
         :in-reply-to:message-id;
        bh=yLqm9xFzqpl7FAtOnLWuT6v+ADtrYtUMtFoW0YMF1Kc=;
        b=KSg7m4ZKOOfFzX4V2kx8/xkrsHxMNEN6rD3syXP4NOPN589ofZ2CZegi7T1yhYMzom
         CS5JGVbQZZS9g6goljSmfQ2oAqTj1sXwnoCsURZPWFm1nmttxBUMdkNUf1w5YdjC1NLv
         L3r19jxd0gtL2Il9SCfWsJo4a06P/B8WBiOdiWKZPbB3mqnGEOtIV/SjEOrtbrjKSg+7
         jMHmaoW+uHOYQG873D91IEwDZfQie+sKfQGvdWDrwuWZ2lh2o7dvgWlZFSlxHTHLmLI+
         DEBWDT5X9RHRYNTaKF+/rv6hEtdHbEhUUHNOTR9IRpAefUoNGOY2iysGot9EgyPbDtTN
         WDEQ==
X-Gm-Message-State: APjAAAVWwVSopvuz0YnwmvcebjWFDh4ClMLTLZk7M6DJiq8DNDBK+I+w
	EUG/UebH2tYoztdoqfaLxRPTgm72wpV8OQ==
X-Google-Smtp-Source: APXvYqwP8WYu45SZDsI2zlHHaY4C1rQ/JCtwLYl1XFs1xVQJPP1g13VRaYMSK8v4+p0jACRyWjt52w==
X-Received: by 2002:ac8:b04:: with SMTP id e4mr5972056qti.272.1570652359813;
        Wed, 09 Oct 2019 13:19:19 -0700 (PDT)
Received: from ip-192-168-22-6.ec2.internal ([85.191.80.161])
        by smtp.gmail.com with ESMTPSA id p53sm1594553qtk.23.2019.10.09.13.19.17
        for <user@uima.apache.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Wed, 09 Oct 2019 13:19:19 -0700 (PDT)
From: Mario Juric <mj@unsilo.ai>
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_01A32305-8F5E-4841-8928-E9DEAB57DADA"
Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\))
Subject: Re: Question about covering annotations in Ruta match semantics
Date: Wed, 9 Oct 2019 22:19:15 +0200
References: <DDF580E9-12A7-44A0-A5E0-E7EA803436A7@unsilo.ai>
 <8409a142-deb5-e1dd-1221-7fde5e95a597@averbis.com>
To: user@uima.apache.org
In-Reply-To: <8409a142-deb5-e1dd-1221-7fde5e95a597@averbis.com>
Message-Id: <8C673F8C-23E1-4480-95FD-F5BC7712438F@unsilo.ai>
X-Mailer: Apple Mail (2.3445.9.1)

--Apple-Mail=_01A32305-8F5E-4841-8928-E9DEAB57DADA
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Hi Peter,

Thanks a lot for the answer.

I am still trying to wrap my head around this, and I understand the =
issues at play when dealing with a generic rule engine, since I am =
looking at an isolated case only. I was just thinking that in my =
particular case the covering annotation starts before matching 'Dog =
Cat=E2=80=99, so why would its ending right before Cat prevent the rule =
from firing? It doesn=E2=80=99t follow Dog, and a rule like =E2=80=9CDog =
Covering {->MARK(CHASE)}=E2=80=9D wouldn=E2=80=99t therefore be matched =
either, but I understand now that it is enough that something else being =
present in this area between the two rule elements is enough for the =
match to fail. However, as you describe, the presence of SPACE =
annotations and a rule like Dog SPACE Cat { -> MARK(CHASE)} would =
succeed in matching despite the presence of the covering annotation.

Have you ever described the implementation of the matching in some paper =
or similar? I would be interested to have a look at it, but maybe it=E2=80=
=99s better just to have a go at the code? I would certainly prefer =
reading a high level abstract specification first though :)

Generally I cannot just trim the annotations in the real application, =
since some of these whitespaces are included in the marking for various =
reasons. I therefore played around with type filtering, since I was =
hoping that the type filter would allow me to match the rules while =
ignoring any presence of filtered types. I was again surprised to find =
out that filtering the Covering type while retaining Cat and Dog would =
in this case just prevent anything from being matched, because it seems =
to make all those text parts invisible where the filtered types appear, =
no matter if they cover any retained annotation types. So this didn=E2=80=99=
t seem to solve my problem either, although I could of course try to =
mark those areas I otherwise would consider trimming and include those =
in the rules like a space or filter on them, which I guess is what you =
suggested. It suddenly just becomes somewhat awkward though, and it may =
just be more clear to use RutaBasic with the rules instead.


Cheers,
Mario













> On 9 Oct 2019, at 09:35 , Peter Kl=C3=BCgl <peter.kluegl@averbis.com> =
wrote:
>=20
> Hi Mario,
>=20
>=20
> I need to take a closer look as this is not the usual scenario :-)
>=20
>=20
> However, without testing, I would assume that the second rule does not
> match because the space between dog and cat is not "empty".
>=20
>=20
> Normally, you have a complete partitioning provided by the seeding =
which
> causes the RutaBasic annotations. If there are only a few annotations,
> then there needs to be a decision if a text position is visible or not
> (as you have no SPACE, BREAK and MARKUP annotation). You would expect
> that the space between the annotations is ignored, but there is =
actually
> no reason why Ruta should do that, as there is no information at all
> that it should be ignored (... generic system, you might want to write
> rules for whitespaces...). In order to avoid this problem in such
> situations there is the option to define empty RutaBasics as =
invisible.
> That are text position where no annotation begins or ends (and not
> covered by annotations) AFAIR and sequential matching could not match =
at
> all anyway. Thus, the first space is ignored, but the not the second,
> because the Covering annotation ends there.
>=20
>=20
> Does that make sense?
>=20
>=20
> I think there are many option how your rules can become more robust, =
but
> that depends on your complete system/pipeline. Is it an option to trim
> annotations in order to avoid whitespaces at the beginning or ending? =
Is
> it easy to identify these positions? You could create an annotation
> there and filter it the type.
>=20
>=20
>=20
> Best,
>=20
>=20
> Peter
>=20
>=20
>=20
> Am 07.10.2019 um 10:21 schrieb Mario Juric:
>> Hi Peter,
>>=20
>> I have a script that is executed without any seeders for performance =
reasons, and we don=E2=80=99t need the seeded annotations in that case. =
I have an issue involving annotation elements that partially cover the =
rule elements of interest, and I do not have a simple solution for it, =
so I have a question about the match semantics. Let me explain it using =
a simple example and the text =E2=80=98cat dog cat=E2=80=99.
>>=20
>> Assume the following 4 annotation types and 2 rule statements:
>>=20
>> DECLARE Covering;
>> DECLARE Cat;
>> DECLARE Dog;
>> DECLARE CHASE;
>> Cat Dog { -> MARK(CHASE)};
>> Dog Cat { -> MARK(CHASE)};
>> Assume prior to script execution the following annotations with =
beginnings and endings:
>>=20
>> Cat[0,3[
>> Dog[4,7[
>> Cat[8,11[
>> Covering[0,8[
>>=20
>> The Covering annotation is an example of the disturbing element that =
I observed, which has nothing or little to do with what I am trying to =
match. It just happens to be there for a reason unrelated to these =
rules, but it causes the second rule not to match when I expected it. =
Only the first rule fires, but the second will also fire when I change =
Covering bounds to [0,7[ though.
>>=20
>> The order in which elements are matched seems very different from how =
they are usually selected from the CAS index, where you would get =
'Covering Cat Dog Cat=E2=80=99, and with this order you would intuitvely =
expect both rules to match. This would probably be overly simplified =
though, since I would not be able to match adjacent covering annotations =
this way, so I believe matching is somehow based on edge detection. =
Sill, I have difficulties to understand why that extra covering space =
makes a difference.
>>=20
>> I was hoping you could provide me with some details, and I also like =
to know what possible workaround options I have. I was considering =
playing around with type filtering, but it would require a bit of =
adding/removing types to be filtered during the script, so it didn=E2=80=99=
t seem as the simplest solution. Ensuring that covering always aligns =
with the end of a token is another possibility in this particular case, =
but I still need to add general robustness to the Ruta script against =
these scenarios. Any feedback is mostly appreciated, thanks :)
>>=20
>> Cheers,
>> Mario
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
>>=20
> --=20
> Dr. Peter Kl=C3=BCgl
> R&D Text Mining/Machine Learning
>=20
> Averbis GmbH
> Salzstr. 15
> 79098 Freiburg
> Germany
>=20
> Fon: +49 761 708 394 0
> Fax: +49 761 708 394 10
> Email: peter.kluegl@averbis.com
> Web: https://averbis.com
>=20
> Headquarters: Freiburg im Breisgau
> Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080
> Managing Directors: Dr. med. Philipp Daumke, Dr. Korn=C3=A9l Mark=C3=B3
>=20


--Apple-Mail=_01A32305-8F5E-4841-8928-E9DEAB57DADA--