From user-return-8188-archive-asf-public=cust-asf.ponee.io@uima.apache.org Fri Aug 30 14:46:42 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 115C518065E for ; Fri, 30 Aug 2019 16:46:41 +0200 (CEST) Received: (qmail 24581 invoked by uid 500); 30 Aug 2019 14:46:41 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 24552 invoked by uid 99); 30 Aug 2019 14:46:41 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Aug 2019 14:46:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 7A33C182B2B for ; Fri, 30 Aug 2019 14:46:40 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.002 X-Spam-Level: * X-Spam-Status: No, score=1.002 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id LXobtK1jIykw for ; Fri, 30 Aug 2019 14:46:38 +0000 (UTC) Received-SPF: None (mailfrom) identity=mailfrom; client-ip=212.227.126.135; helo=mout.kundenserver.de; envelope-from=peter.kluegl@averbis.com; receiver= Received: from mout.kundenserver.de (mout.kundenserver.de [212.227.126.135]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 7DA527D3FC for ; Fri, 30 Aug 2019 14:46:37 +0000 (UTC) Received: from [192.168.11.106] ([212.60.243.36]) by mrelayeu.kundenserver.de (mreue011 [212.227.15.167]) with ESMTPSA (Nemesis) id 1Mk0e8-1iS1NP0Vk8-00kSkE for ; Fri, 30 Aug 2019 16:46:31 +0200 Subject: Re: Usage of anchors To: user@uima.apache.org References: <14D2ABCA-1316-4795-A3E3-2F5648B47A48@drooms.com> <963e784f-df82-0993-3d4c-93ecd49e37b9@averbis.com> From: =?UTF-8?Q?Peter_Kl=c3=bcgl?= Openpgp: preference=signencrypt Message-ID: Date: Fri, 30 Aug 2019 16:46:30 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Provags-ID: V03:K1:FtEFOAF/9H1YKJ1JCXGVF+Cswg9snQy3YCbOTB0tDD3PPeXKYsN lnSSGT3xQ9T+B9x4SZUh8NA8Msprrc06ULXhtFuFhys/fPCkTQ2L8QQyhGpZWwqG66odTKB gTqM/bOWAp9DpqOPDPx5gLDu36WUrJGz2fFWbZnIFua+m++Ph/jlQ29P3BnIrnWoQsrTMUd sL1MgHImiAReQzvPdIWAw== X-UI-Out-Filterresults: notjunk:1;V03:K0:b49ZJ9C8Xso=:VtSpNTXOiPimz114vAdI3y xv7OutugGnNwjiNn0/S2QpOMeiKM/Upitl8eX7RnAAfSSl+lU10RyiRwoKGB3yULl4Dtt5BQh pexioXsTNVUFLgfc0ovX+Wmy9+hhFMCpoXfKdA1uu9WrgY97CX1P4fCLO5dgkMmgziN0TU9so MTCWWrRm4TagqlEglFRCWrLw0r+0bYAPHPqgweotV77msFCIUbUaOpOepRfBvVwrZAmiLEkYA lBjlzEcvEC5mTJFjcbK31wWTFyWLY2jfMIX2K8kQ8joRDdaJ2CfVaUv6lqTB1yeBt4meET8Lk T7lRAiQd5+qdhDyBcMwegmRIr+5xHFMe/LBQvcixhZIyOZC6HoLk2PEF+zS5CzYqgp8PrBIcI 0qd1DDynqpc8n8GZ7qxTCWnnRLzalXsRCw78wx6wLw6rtD0hkRdtpvQL91DKxAsWQKCK/pjaE JOC9sP5xjGUQ2Qe9s5zG8a8cmhvm4jIEZ3DOeYSVeQM/6n8nhItn7xupCo1/yLcKCaAdjUwrp boBiIk5grF2aYYmIYC/9uuzXulxiOXi9FRQi219J6XCFC5p8Yh+Hy4BZw6NECmZibu+marIFn U6p2y49aTQpRABWCBUgL0+06kzzFF33B1Ip5HlQuxjHbG3xbvnNiGGDHDF+4iXlftGyec7dqB J03PZAqrHUgBSBeFfyOXKxrsjY0iYheRSRhfdYDPn26jA/eAZNyvZn3lpCVlP4JZEVW/Eal6X G5f29mtG6h638Cu7ixZkcl34RkrrW7bH1Vcba6sTh5MGzZ1RWLXiLT6k5LxzHmEd4+JgM4lM2 8VksAvkQFoTfNWeSm0Ok0mX8JU/n8sPSi0BFIAQEtKDjmbedgg= Hi, Am 29.08.2019 um 17:31 schrieb Nikolai Krot: > Hi Peter, > > Thank you for your answer. Is this the relevant issue: > https://issues.apache.org/jira/browse/UIMA-3862 ? Yes. (The description should really be more informative) > > Honestly, your answer is a revelation for me :) I originally though that > matching on literals should be faster because no extra step of preliminary > annotation thereof is required. Can I expect a speed up if I implement the > rules as follows: > > 1) all/most of literals that are found in rules are first wrapped into an > annotation, say, WRD; > > MARKTABLE(WDR, VocabularyOfWordsAppearingInRules); > > 2) the rules that rely on these literals are rewritten to be something like > this: > > ... @WRD.ct == "hello" ... {-> ACTION1}; > ... @WRD.ct == "world" ... {-> ACTION2}; > > Im just curious. We are trying to figure out what is the best tactics of > writing the rules to guarantee they work am schnellsten. Yes, I would assume that it is faster. However, it depends on many factors, e.g., the distributions of the words, length of the document and the length of the rule and index of the anchor. I would recommend the usage of FOREACH in order to avoid redudant index matches on the same annotation. In my use cases, the initialization of the stream is often relatively expensive since there are many Ruta compoments in a pipeline that each reindex the RutaBasics anew. Thus, the speed of a rule is sometimes not as important as the combination with other annotators. Best, Peter > Best regards, > Nikolai > > > On Thu, Aug 29, 2019 at 3:26 PM Peter Klügl > wrote: > >> Hi, >> >> Am 29.08.2019 um 15:21 schrieb Nikolai Krot: >>> Hi Peter, >>> >>> thank you for your answer. Can you confirm my understanding (i have >> certain >>> difficulty understanding stacked negations) >>> >>> * it may be a problem if a literal string in a rule is also an anchor >>> (either explicitly set by user or selected by rule interpreter) >> >> yes, it is especially inefficient because there is no index on the >> covered text. The rule element needs to evaluate very RutaBasic in the >> current window (document) by comparing the covered text to the string >> value. It is of course much slower since you could normally restrict the >> type of annotation somehow and use an annotation index. >> >> >> Best, >> >> >> Peter >> >> >>> Best regards, >>> Nikolai >>> >>> On Thu, Aug 29, 2019 at 2:27 PM Peter Klügl >>> wrote: >>> >>>> Hi, >>>> >>>> >>>> the second option should be preferred at least until UIMA-3862 is >>>> resolved with some additional indexing. >>>> >>>> It is of course not so problematic if the literal matching condition is >>>> not the starting anchor. However, it is still annoying that the rule >>>> lements need to be designed according the dynamic partitioning of the >>>> RutaBasis. This easily leads to problems is larger pipelines. >>>> >>>> >>>> Best, >>>> >>>> >>>> Peter >>>> >>>> >>>> Am 29.08.2019 um 11:59 schrieb Nikolai Krot: >>>>> Hi Peter, >>>>> >>>>> I have a question about this comment of yours: >>>>> >>>>> < ... but the matching using literal string expression is still really >>>>> inefficient. >>>>> >>>>> What do you mean by "inefficient"? Do you mean it is slow? Say, if I >> want >>>>> to use a literal in one hundred rules, what is a better strategy: >>>>> 1) writing the string literally in every of these 100 rules; or >>>>> 2) annotating the string (using MARKTABLE) and they using the >> annotation >>>> in >>>>> these 100 rules? >>>>> >>>>> Best regards, >>>>> Nikolai >>>>> >>>>> On Mon, Aug 26, 2019 at 2:27 PM Peter Klügl >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> >>>>>> Am 21.08.2019 um 15:47 schrieb Dominik Terweh: >>>>>>> Hi Peter, >>>>>>> >>>>>>> Thanks a lot for the clarification. I was wondering about (10) too. >>>>>>> >>>>>>> Following your explanation I was wondering, Does it make sense to >>>> anchor >>>>>> sequences, such as in (8) and is it "legal" to use multiple anchors in >>>>>> hierarchical fashion? >>>>>>> Like A @(B @C D)? >>>>>> Yes, it is "legal", but you have to be careful. (There are not enough >>>>>> unit tests for those rules) >>>>>> >>>>>> >>>>>>> Also, is there a difference between the processing of sequences of >>>>>> annotations or literals (given "A" is annotated as A and so on)? >>>>>>> A @(B C D) >>>>>>> Vs >>>>>>> "A" @("B" "C" "D") >>>>>>> Vs >>>>>>> A @("B" C "D") >>>>>> It should not make a difference for the result, but the matching using >>>>>> literal string epxression is still really inefficient. >>>>>> >>>>>> >>>>>> Best, >>>>>> >>>>>> >>>>>> Peter >>>>>> >>>>>> >>>>>>> Best >>>>>>> Dominik >>>>>>> >>>>>>> >>>>>>> >>>>>>> Dominik Terweh >>>>>>> Praktikant >>>>>>> >>>>>>> DROOMS >>>>>>> >>>>>>> >>>>>>> Drooms GmbH >>>>>>> Eschersheimer Landstraße 6 >>>>>>> 60322 Frankfurt, Germany >>>>>>> www.drooms.com >>>>>>> >>>>>>> Phone: >>>>>>> Fax: >>>>>>> Mail: d.terweh@drooms.com >>>>>>> >>>>>>> >>>>>>> Subscribe to the Drooms newsletter >> https://drooms.com/en/newsletter?utm_source=newslettersignup&utm_medium=emailsignature >>>>>>> Drooms GmbH; Sitz der Gesellschaft / Registered Office: Eschersheimer >>>>>> Landstr. 6, D-60322 Frankfurt am Main; Geschaeftsfuehrung / Management >>>>>> Board: Alexandre Grellier; >>>>>>> Registergericht / Court of Registration: Amtsgericht Frankfurt am >> Main, >>>>>> HRB 76454; Finanzamt / Tax Office: Finanzamt Frankfurt am Main, >>>> USt-IdNr.: >>>>>> DE 224007190 >>>>>>> On 21.08.19, 12:10, "Peter Klügl" wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Am 20.08.2019 um 16:09 schrieb Dominik Terweh: >>>>>>> > >>>>>>> > Dear All, >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > I have some questions regarding processing times and anchors >>>> ("@"). >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > First of all, is it possible to define an anchor on a >>>> disjunction? >>>>>>> > >>>>>>> > What I tested was to have a simple rule (1) that should start >> on >>>>>> the >>>>>>> > Element in the middle (2). Now this element had a variation (3) >>>>>> but I >>>>>>> > could not use the anchor in that case anymore: >>>>>>> > >>>>>>> > 1) A B C; // works >>>>>>> > >>>>>>> > 2) A @B C; // works >>>>>>> > >>>>>>> > 3) A @(B|D) C; // NOT WORKING >>>>>>> > >>>>>>> > Is this behaviour intended or simply not supported? >>>>>>> > >>>>>>> > [NOTE: NOT WORKING means eclipse does not complain, but the >> rule >>>>>> never >>>>>>> > matches] >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > The above led to some testing with a different setup(4), >> however, >>>>>>> > since disjunctions don't seem to work, this was also not valid. >>>>>>> > >>>>>>> > 4) A @((B C) | (D C)); // NOT WORKING >>>>>>> > >>>>>>> >>>>>>> Anchors at disjunct rule elements are syntactically supported but >>>> do >>>>>> not >>>>>>> work correctly. I will open a bug ticket. >>>>>>> >>>>>>> >>>>>>> > >>>>>>> > >>>>>>> > Is there a scenario where anchors are valid in and before >>>> brackets? >>>>>>> > From my observation I've seen that (5)-(10) are all working as >>>>>>> > expected and all start matching on B. But, do they differ in >>>> terms >>>>>> of >>>>>>> > processing? I noticed slightly longer processing times in (5) >> and >>>>>> ever >>>>>>> > so slightly in (6), but not very indicative. Could (5)-(10) >>>> differ >>>>>> in >>>>>>> > processing time? >>>>>>> > >>>>>>> > 5) A @B C >>>>>>> > >>>>>>> > 6) (A @B C) >>>>>>> > >>>>>>> > 7) @(A @B C) >>>>>>> > >>>>>>> > 8) A @(B C) >>>>>>> > >>>>>>> > 9) A @(@B C) >>>>>>> > >>>>>>> > 10) A (@B C) >>>>>>> > >>>>>>> >>>>>>> Yes since different combinations of methods are called, but I >> think >>>>>>> there should not be a big difference between (5)-(9). >>>>>>> >>>>>>> >>>>>>> > >>>>>>> > >>>>>>> > Since rule (10) works as expected, why does (11) work >> differently >>>>>> and >>>>>>> > start on A but not on B and D? (This would be useful in a >>>> scenario >>>>>>> > where B and D combined appear less often than A) >>>>>>> > >>>>>>> > 11) A ((@B C) | (@D C)); // starts matching on A >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> >>>>>>> I have to check that. I think (10) start with A too. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Two comments for anchors and disjunct rule elements: >>>>>>> >>>>>>> Anchors started as a manual option to optimize the rule execution >>>>>> time >>>>>>> compared tot he automatic dynamic anchoring. However, the anchor >>>> can >>>>>>> considerably change the consequences of a rule. For me, the >> anchor >>>> is >>>>>>> more of an engineering option which also can be used to speed up >>>> the >>>>>> rules. >>>>>>> Disjunct rule elements are not well supported and maintained in >>>> Ruta. >>>>>>> Their implementation is not efficient and they can lead to >>>> unintened >>>>>>> matches. Thus, their usage is not allowed in my team and I would >>>> not >>>>>>> recommend using them right now. >>>>>>> >>>>>>> >>>>>>> (I will try to find the time to improve the implementation) >>>>>>> >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> >>>>>>> Peter >>>>>>> >>>>>>> >>>>>>> > Thank you in advance for your answers, >>>>>>> > >>>>>>> > Best >>>>>>> > >>>>>>> > Dominik >>>>>>> > >>>>>>> > Dominik Terweh >>>>>>> > Praktikant >>>>>>> > >>>>>>> > *Drooms GmbH* >>>>>>> > Eschersheimer Landstraße 6 >>>>>>> > 60322 Frankfurt, Germany >>>>>>> > www.drooms.com >>>>>>> > >>>>>>> > Phone: >>>>>>> > Mail: d.terweh@drooms.com >>>>>>> > >>>>>>> > < >> https://drooms.com/en/newsletter?utm_source=newslettersignup&utm_medium=emailsignature >>>>>>> > >>>>>>> > *Drooms GmbH*; Sitz der Gesellschaft / Registered Office: >>>>>>> > Eschersheimer Landstr. 6, D-60322 Frankfurt am Main; >>>>>> Geschäftsführung >>>>>>> > / Management Board: Alexandre Grellier; >>>>>>> > Registergericht / Court of Registration: Amtsgericht Frankfurt >> am >>>>>>> > Main, HRB 76454; Finanzamt / Tax Office: Finanzamt Frankfurt am >>>>>> Main, >>>>>>> > USt-IdNr.: DE 224007190 >>>>>>> > >>>>>>> -- >>>>>>> Dr. Peter Klügl >>>>>>> R&D Text Mining/Machine Learning >>>>>>> >>>>>>> Averbis GmbH >>>>>>> Salzstr. 15 >>>>>>> 79098 Freiburg >>>>>>> Germany >>>>>>> >>>>>>> Fon: +49 761 708 394 0 >>>>>>> Fax: +49 761 708 394 10 >>>>>>> Email: peter.kluegl@averbis.com >>>>>>> Web: https://averbis.com >>>>>>> >>>>>>> Headquarters: Freiburg im Breisgau >>>>>>> Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080 >>>>>>> Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó >>>>>>> >>>>>>> >>>>>>> >>>>>> -- >>>>>> Dr. Peter Klügl >>>>>> R&D Text Mining/Machine Learning >>>>>> >>>>>> Averbis GmbH >>>>>> Salzstr. 15 >>>>>> 79098 Freiburg >>>>>> Germany >>>>>> >>>>>> Fon: +49 761 708 394 0 >>>>>> Fax: +49 761 708 394 10 >>>>>> Email: peter.kluegl@averbis.com >>>>>> Web: https://averbis.com >>>>>> >>>>>> Headquarters: Freiburg im Breisgau >>>>>> Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080 >>>>>> Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó >>>>>> >>>>>> >>>> -- >>>> Dr. Peter Klügl >>>> R&D Text Mining/Machine Learning >>>> >>>> Averbis GmbH >>>> Salzstr. 15 >>>> 79098 Freiburg >>>> Germany >>>> >>>> Fon: +49 761 708 394 0 >>>> Fax: +49 761 708 394 10 >>>> Email: peter.kluegl@averbis.com >>>> Web: https://averbis.com >>>> >>>> Headquarters: Freiburg im Breisgau >>>> Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080 >>>> Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó >>>> >>>> >> -- >> Dr. Peter Klügl >> R&D Text Mining/Machine Learning >> >> Averbis GmbH >> Salzstr. 15 >> 79098 Freiburg >> Germany >> >> Fon: +49 761 708 394 0 >> Fax: +49 761 708 394 10 >> Email: peter.kluegl@averbis.com >> Web: https://averbis.com >> >> Headquarters: Freiburg im Breisgau >> Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080 >> Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó >> >> -- Dr. Peter Klügl R&D Text Mining/Machine Learning Averbis GmbH Salzstr. 15 79098 Freiburg Germany Fon: +49 761 708 394 0 Fax: +49 761 708 394 10 Email: peter.kluegl@averbis.com Web: https://averbis.com Headquarters: Freiburg im Breisgau Register Court: Amtsgericht Freiburg im Breisgau, HRB 701080 Managing Directors: Dr. med. Philipp Daumke, Dr. Kornél Markó