From user-return-8242-archive-asf-public=cust-asf.ponee.io@uima.apache.org Wed Oct 9 07:35:25 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id A6F88180645 for ; Wed, 9 Oct 2019 09:35:25 +0200 (CEST) Received: (qmail 30686 invoked by uid 500); 9 Oct 2019 07:35:24 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 30674 invoked by uid 99); 9 Oct 2019 07:35:24 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Oct 2019 07:35:24 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 70789C2E6C for ; Wed, 9 Oct 2019 07:35:23 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.001 X-Spam-Level: * X-Spam-Status: No, score=1.001 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, KAM_LINEPADDING=1.2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=averbis.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 4MR9nxde1BS2 for ; Wed, 9 Oct 2019 07:35:20 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2a00:1450:4864:20::42c; helo=mail-wr1-x42c.google.com; envelope-from=peter.kluegl@averbis.com; receiver= Received: from mail-wr1-x42c.google.com (mail-wr1-x42c.google.com [IPv6:2a00:1450:4864:20::42c]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id B73F07F571 for ; Wed, 9 Oct 2019 07:35:19 +0000 (UTC) Received: by mail-wr1-x42c.google.com with SMTP id j11so1478136wrp.1 for ; Wed, 09 Oct 2019 00:35:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=averbis.com; s=google; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-transfer-encoding; bh=hD35UtipWXYbFyGLwZjmB12UFy75ylJyOg5BJS2bDv0=; b=ZlPrcoDUZasHDcliL2Epdu6tLmj3vJN89N5j9594KPE3KAZdu0ZtHYSEcv8A8/mP3X wfhk1qi6u11Tyw0LFFbo1oo6TRMer8A0mB27d8r6vSICZaC5+K4fOxjyX8NXkEnv37Ji FBTDZFPUl+vSKt8vwIe4O6GvQKgv1Bk6gjScgfwi/nwpbYE+u/cbMvt/TAPZqieWGTe8 jj+R5i8adot8VXbN2oCwwO9H7exEy8Rorm400kwgpFSrLwMCtBcjRpNU5DYN5t5DJ0Y3 MCNWun2TuM/0t4VY83rIcyfranrCCvD/Pzs09Ik3VdeDacePw6bzddyD9CfUodznARjn el6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding; bh=hD35UtipWXYbFyGLwZjmB12UFy75ylJyOg5BJS2bDv0=; b=cGPzM7vc66oXBsS8XUSFbUlETaBjiJeV4N7n7MRzdl5yTG//DPbmTjPcFDFsYEu7oF wW5omrbSzTIrrZkjww6eQwMTr6FQzK0gKLuICcyLuDLlqVdXsiD3CvaJvvUqx/B5BTZ3 3zB7xcQoQBZ+3G6NID7q32YvSzb2XRcTuPYAniLwR4Sk/NmIE+ol15LgqmQ67Mum/2nQ oA/vLpE134u8oPZ2dT4OQSHnXHsf8AX2IWBRQRBcfgC5mDF02qhaePv6nzodmQUo3o8m XbKu1r1gKt70SKH/bwd9fKV1MuymCkTDw6+XU9mfKKTx85Lncq1uZ8uPsM0mVpdbxXTC MpmA== X-Gm-Message-State: APjAAAX1X3VFXor8wsDk2+xBtKj1qitFI+sIB17kG2/G/TP4tZLl5Dw4 QPR0xqYqp/mh6leTh3eF1FQNxTOh8s4= X-Google-Smtp-Source: APXvYqxafgNWUeImUcTahoE6SLcmCMInMeRCM2M/NuZO0S3/4YBxjCor+05wBFGclCzNXVb2HYE0NA== X-Received: by 2002:adf:e750:: with SMTP id c16mr1622912wrn.244.1570606518928; Wed, 09 Oct 2019 00:35:18 -0700 (PDT) Received: from [192.168.11.106] (port-212-60-243-36.static.qsc.de. [212.60.243.36]) by smtp.gmail.com with ESMTPSA id y18sm3096190wro.36.2019.10.09.00.35.18 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Oct 2019 00:35:18 -0700 (PDT) Subject: Re: Question about covering annotations in Ruta match semantics To: user@uima.apache.org References: From: =?UTF-8?Q?Peter_Kl=c3=bcgl?= Message-ID: <8409a142-deb5-e1dd-1221-7fde5e95a597@averbis.com> Date: Wed, 9 Oct 2019 09:35:17 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Hi Mario, I need to take a closer look as this is not the usual scenario :-) However, without testing, I would assume that the second rule does not match because the space between dog and cat is not "empty". Normally, you have a complete partitioning provided by the seeding which causes the RutaBasic annotations. If there are only a few annotations, then there needs to be a decision if a text position is visible or not (as you have no SPACE, BREAK and MARKUP annotation). You would expect that the space between the annotations is ignored, but there is actually no reason why Ruta should do that, as there is no information at all that it should be ignored (... generic system, you might want to write rules for whitespaces...). In order to avoid this problem in such situations there is the option to define empty RutaBasics as invisible. That are text position where no annotation begins or ends (and not covered by annotations) AFAIR and sequential matching could not match at all anyway. Thus, the first space is ignored, but the not the second, because the Covering annotation ends there. Does that make sense? I think there are many option how your rules can become more robust, but that depends on your complete system/pipeline. Is it an option to trim annotations in order to avoid whitespaces at the beginning or ending? Is it easy to identify these positions? You could create an annotation there and filter it the type. Best, Peter Am 07.10.2019 um 10:21 schrieb Mario Juric: > Hi Peter, > > I have a script that is executed without any seeders for performance reasons, and we don’t need the seeded annotations in that case. I have an issue involving annotation elements that partially cover the rule elements of interest, and I do not have a simple solution for it, so I have a question about the match semantics. Let me explain it using a simple example and the text ‘cat dog cat’. > > Assume the following 4 annotation types and 2 rule statements: > > DECLARE Covering; > DECLARE Cat; > DECLARE Dog; > DECLARE CHASE; > Cat Dog { -> MARK(CHASE)}; > Dog Cat { -> MARK(CHASE)}; > Assume prior to script execution the following annotations with beginnings and endings: > > Cat[0,3[ > Dog[4,7[ > Cat[8,11[ > Covering[0,8[ > > The Covering annotation is an example of the disturbing element that I observed, which has nothing or little to do with what I am trying to match. It just happens to be there for a reason unrelated to these rules, but it causes the second rule not to match when I expected it. Only the first rule fires, but the second will also fire when I change Covering bounds to [0,7[ though. > > The order in which elements are matched seems very different from how they are usually selected from the CAS index, where you would get 'Covering Cat Dog Cat’, and with this order you would intuitvely expect both rules to match. This would probably be overly simplified though, since I would not be able to match adjacent covering annotations this way, so I believe matching is somehow based on edge detection. Sill, I have difficulties to understand why that extra covering space makes a difference. > > I was hoping you could provide me with some details, and I also like to know what possible workaround options I have. I was considering playing around with type filtering, but it would require a bit of adding/removing types to be filtered during the script, so it didn’t seem as the simplest solution. Ensuring that covering always aligns with the end of a token is another possibility in this particular case, but I still need to add general robustness to the Ruta script against these scenarios. Any feedback is mostly appreciated, thanks :) > > Cheers, > Mario > > > > > > > > > > -- Dr. Peter Klügl R&D Text Mining/Machine Learning Averbis GmbH Salzstr. 15 79098 Freiburg Germany Fon: +49 761 708 394 0 Fax: +49 761 708 394 10 Email: peter.kluegl@averbis.com Web: https://averbis.com Headquarters: Freiburg im Breisgau Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080 Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó