From user-return-8243-archive-asf-public=cust-asf.ponee.io@uima.apache.org Wed Oct 9 20:19:25 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id E666B180645 for ; Wed, 9 Oct 2019 22:19:24 +0200 (CEST) Received: (qmail 63212 invoked by uid 500); 9 Oct 2019 20:19:24 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 63200 invoked by uid 99); 9 Oct 2019 20:19:23 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Oct 2019 20:19:23 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 1A77DC2147 for ; Wed, 9 Oct 2019 20:19:23 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.249 X-Spam-Level: ** X-Spam-Status: No, score=2.249 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.249, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=unsilo-ai.20150623.gappssmtp.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id NRQB1Hq9A5eq for ; Wed, 9 Oct 2019 20:19:21 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.160.172; helo=mail-qt1-f172.google.com; envelope-from=mj@unsilo.com; receiver= Received: from mail-qt1-f172.google.com (mail-qt1-f172.google.com [209.85.160.172]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id E4300BCA1F for ; Wed, 9 Oct 2019 20:19:20 +0000 (UTC) Received: by mail-qt1-f172.google.com with SMTP id l51so5041685qtc.4 for ; Wed, 09 Oct 2019 13:19:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=unsilo-ai.20150623.gappssmtp.com; s=20150623; h=from:mime-version:subject:date:references:to:in-reply-to:message-id; bh=yLqm9xFzqpl7FAtOnLWuT6v+ADtrYtUMtFoW0YMF1Kc=; b=fRtLpZzs0hV90BmyNfkGmKezCfzeitc8nHQ3oIP4h3yibC8kPVnMkr9RsEIjHkN0QN 5bHLwqJbRGEwnbtCE9WfXnHJMtZrFWvoYXp0RpXfGbks7iLuwPwlGM6zFeTw2VZJX9mS HBySgCO6mx/3/M+s+eNmd9fs0m6wrVRXDbdwKV0hW55OybqMIzBLnME8geYVUGTaNdMm kliJYX1IcdBxBHzhZlbaSoLdSLrkddP7f/xeCCRwK203UTn99JlrN9sioSe0O9BiZokB dy+B6hu2LhdANaapoHg5iVzN0t34PF6TYcJT7Mn8kQUTPQEPEEtZVqO6HS2ZQp6v/wjh fm8Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:mime-version:subject:date:references:to :in-reply-to:message-id; bh=yLqm9xFzqpl7FAtOnLWuT6v+ADtrYtUMtFoW0YMF1Kc=; b=KSg7m4ZKOOfFzX4V2kx8/xkrsHxMNEN6rD3syXP4NOPN589ofZ2CZegi7T1yhYMzom CS5JGVbQZZS9g6goljSmfQ2oAqTj1sXwnoCsURZPWFm1nmttxBUMdkNUf1w5YdjC1NLv L3r19jxd0gtL2Il9SCfWsJo4a06P/B8WBiOdiWKZPbB3mqnGEOtIV/SjEOrtbrjKSg+7 jMHmaoW+uHOYQG873D91IEwDZfQie+sKfQGvdWDrwuWZ2lh2o7dvgWlZFSlxHTHLmLI+ DEBWDT5X9RHRYNTaKF+/rv6hEtdHbEhUUHNOTR9IRpAefUoNGOY2iysGot9EgyPbDtTN WDEQ== X-Gm-Message-State: APjAAAVWwVSopvuz0YnwmvcebjWFDh4ClMLTLZk7M6DJiq8DNDBK+I+w EUG/UebH2tYoztdoqfaLxRPTgm72wpV8OQ== X-Google-Smtp-Source: APXvYqwP8WYu45SZDsI2zlHHaY4C1rQ/JCtwLYl1XFs1xVQJPP1g13VRaYMSK8v4+p0jACRyWjt52w== X-Received: by 2002:ac8:b04:: with SMTP id e4mr5972056qti.272.1570652359813; Wed, 09 Oct 2019 13:19:19 -0700 (PDT) Received: from ip-192-168-22-6.ec2.internal ([85.191.80.161]) by smtp.gmail.com with ESMTPSA id p53sm1594553qtk.23.2019.10.09.13.19.17 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Oct 2019 13:19:19 -0700 (PDT) From: Mario Juric Content-Type: multipart/alternative; boundary="Apple-Mail=_01A32305-8F5E-4841-8928-E9DEAB57DADA" Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\)) Subject: Re: Question about covering annotations in Ruta match semantics Date: Wed, 9 Oct 2019 22:19:15 +0200 References: <8409a142-deb5-e1dd-1221-7fde5e95a597@averbis.com> To: user@uima.apache.org In-Reply-To: <8409a142-deb5-e1dd-1221-7fde5e95a597@averbis.com> Message-Id: <8C673F8C-23E1-4480-95FD-F5BC7712438F@unsilo.ai> X-Mailer: Apple Mail (2.3445.9.1) --Apple-Mail=_01A32305-8F5E-4841-8928-E9DEAB57DADA Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Hi Peter, Thanks a lot for the answer. I am still trying to wrap my head around this, and I understand the = issues at play when dealing with a generic rule engine, since I am = looking at an isolated case only. I was just thinking that in my = particular case the covering annotation starts before matching 'Dog = Cat=E2=80=99, so why would its ending right before Cat prevent the rule = from firing? It doesn=E2=80=99t follow Dog, and a rule like =E2=80=9CDog = Covering {->MARK(CHASE)}=E2=80=9D wouldn=E2=80=99t therefore be matched = either, but I understand now that it is enough that something else being = present in this area between the two rule elements is enough for the = match to fail. However, as you describe, the presence of SPACE = annotations and a rule like Dog SPACE Cat { -> MARK(CHASE)} would = succeed in matching despite the presence of the covering annotation. Have you ever described the implementation of the matching in some paper = or similar? I would be interested to have a look at it, but maybe it=E2=80= =99s better just to have a go at the code? I would certainly prefer = reading a high level abstract specification first though :) Generally I cannot just trim the annotations in the real application, = since some of these whitespaces are included in the marking for various = reasons. I therefore played around with type filtering, since I was = hoping that the type filter would allow me to match the rules while = ignoring any presence of filtered types. I was again surprised to find = out that filtering the Covering type while retaining Cat and Dog would = in this case just prevent anything from being matched, because it seems = to make all those text parts invisible where the filtered types appear, = no matter if they cover any retained annotation types. So this didn=E2=80=99= t seem to solve my problem either, although I could of course try to = mark those areas I otherwise would consider trimming and include those = in the rules like a space or filter on them, which I guess is what you = suggested. It suddenly just becomes somewhat awkward though, and it may = just be more clear to use RutaBasic with the rules instead. Cheers, Mario > On 9 Oct 2019, at 09:35 , Peter Kl=C3=BCgl = wrote: >=20 > Hi Mario, >=20 >=20 > I need to take a closer look as this is not the usual scenario :-) >=20 >=20 > However, without testing, I would assume that the second rule does not > match because the space between dog and cat is not "empty". >=20 >=20 > Normally, you have a complete partitioning provided by the seeding = which > causes the RutaBasic annotations. If there are only a few annotations, > then there needs to be a decision if a text position is visible or not > (as you have no SPACE, BREAK and MARKUP annotation). You would expect > that the space between the annotations is ignored, but there is = actually > no reason why Ruta should do that, as there is no information at all > that it should be ignored (... generic system, you might want to write > rules for whitespaces...). In order to avoid this problem in such > situations there is the option to define empty RutaBasics as = invisible. > That are text position where no annotation begins or ends (and not > covered by annotations) AFAIR and sequential matching could not match = at > all anyway. Thus, the first space is ignored, but the not the second, > because the Covering annotation ends there. >=20 >=20 > Does that make sense? >=20 >=20 > I think there are many option how your rules can become more robust, = but > that depends on your complete system/pipeline. Is it an option to trim > annotations in order to avoid whitespaces at the beginning or ending? = Is > it easy to identify these positions? You could create an annotation > there and filter it the type. >=20 >=20 >=20 > Best, >=20 >=20 > Peter >=20 >=20 >=20 > Am 07.10.2019 um 10:21 schrieb Mario Juric: >> Hi Peter, >>=20 >> I have a script that is executed without any seeders for performance = reasons, and we don=E2=80=99t need the seeded annotations in that case. = I have an issue involving annotation elements that partially cover the = rule elements of interest, and I do not have a simple solution for it, = so I have a question about the match semantics. Let me explain it using = a simple example and the text =E2=80=98cat dog cat=E2=80=99. >>=20 >> Assume the following 4 annotation types and 2 rule statements: >>=20 >> DECLARE Covering; >> DECLARE Cat; >> DECLARE Dog; >> DECLARE CHASE; >> Cat Dog { -> MARK(CHASE)}; >> Dog Cat { -> MARK(CHASE)}; >> Assume prior to script execution the following annotations with = beginnings and endings: >>=20 >> Cat[0,3[ >> Dog[4,7[ >> Cat[8,11[ >> Covering[0,8[ >>=20 >> The Covering annotation is an example of the disturbing element that = I observed, which has nothing or little to do with what I am trying to = match. It just happens to be there for a reason unrelated to these = rules, but it causes the second rule not to match when I expected it. = Only the first rule fires, but the second will also fire when I change = Covering bounds to [0,7[ though. >>=20 >> The order in which elements are matched seems very different from how = they are usually selected from the CAS index, where you would get = 'Covering Cat Dog Cat=E2=80=99, and with this order you would intuitvely = expect both rules to match. This would probably be overly simplified = though, since I would not be able to match adjacent covering annotations = this way, so I believe matching is somehow based on edge detection. = Sill, I have difficulties to understand why that extra covering space = makes a difference. >>=20 >> I was hoping you could provide me with some details, and I also like = to know what possible workaround options I have. I was considering = playing around with type filtering, but it would require a bit of = adding/removing types to be filtered during the script, so it didn=E2=80=99= t seem as the simplest solution. Ensuring that covering always aligns = with the end of a token is another possibility in this particular case, = but I still need to add general robustness to the Ruta script against = these scenarios. Any feedback is mostly appreciated, thanks :) >>=20 >> Cheers, >> Mario >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 >>=20 > --=20 > Dr. Peter Kl=C3=BCgl > R&D Text Mining/Machine Learning >=20 > Averbis GmbH > Salzstr. 15 > 79098 Freiburg > Germany >=20 > Fon: +49 761 708 394 0 > Fax: +49 761 708 394 10 > Email: peter.kluegl@averbis.com > Web: https://averbis.com >=20 > Headquarters: Freiburg im Breisgau > Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080 > Managing Directors: Dr. med. Philipp Daumke, Dr. Korn=C3=A9l Mark=C3=B3 >=20 --Apple-Mail=_01A32305-8F5E-4841-8928-E9DEAB57DADA--