Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 57361182EA for ; Thu, 7 Jan 2016 09:22:47 +0000 (UTC) Received: (qmail 22615 invoked by uid 500); 7 Jan 2016 09:22:47 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 22570 invoked by uid 500); 7 Jan 2016 09:22:47 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 22558 invoked by uid 99); 7 Jan 2016 09:22:46 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Jan 2016 09:22:46 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 513FEC0D31 for ; Thu, 7 Jan 2016 09:22:46 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.1 X-Spam-Level: X-Spam-Status: No, score=-0.1 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id Q3jP-xCJZLLh for ; Thu, 7 Jan 2016 09:22:36 +0000 (UTC) Received: from mail-lf0-f45.google.com (mail-lf0-f45.google.com [209.85.215.45]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 54381439CA for ; Thu, 7 Jan 2016 09:22:35 +0000 (UTC) Received: by mail-lf0-f45.google.com with SMTP id z124so323386814lfa.3 for ; Thu, 07 Jan 2016 01:22:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=wmfGpCb24kNm4WMOXBN8CEt8ubD9yEUacYFpBGRiUtc=; b=eyFdHFxY9bJnZf+nCnYaUa6H0YEGz2Vy/ZjlLxxFX9uj+l7CuqLcPSpTKKfCRtIHww tJVSW5lTov9QHDKlFtbJujA4SxXpQQvkMnvuVU1+VmKA2QtuzOxsqS6XCEOba0R+psnR yOgFLFAedAIjs7Azx3FcYNnvU+r+ORK5vcPj861WqxTIWu5LwJa99PGsxCI0F6Nlz1ML q58GWET6jGUx+HJTJ8psHS0+EOY7CYMcQkNjlJDlLjLHCXTG1yctNJ2wL/VlBcDTLuX5 TGqyutxFwICmtcP2RUbsSvCACmboM7gAGG1R1GtnbEuxBGfGRqZZF7oK/WSWWLOPTC7Y uoFA== X-Received: by 10.25.86.211 with SMTP id k202mr17548436lfb.69.1452158548317; Thu, 07 Jan 2016 01:22:28 -0800 (PST) Received: from [10.0.0.6] ([87.104.236.202]) by smtp.gmail.com with ESMTPSA id ax1sm1530656lbc.20.2016.01.07.01.22.27 for (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Thu, 07 Jan 2016 01:22:27 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\)) Subject: Re: Very long Ruta stream initialization From: Mario Gazzo In-Reply-To: <568E2CD3.4040805@averbis.com> Date: Thu, 7 Jan 2016 10:22:26 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: References: <5683FBCE.7020807@averbis.com> <91BA95B1-DF4C-459F-BF59-7A21F2DC585B@gmail.com> <568A987E.1050205@averbis.com> <234FB22B-46FA-4572-9FCE-446983715E60@gmail.com> <568E1D1B.2040809@averbis.com> <7C01BF55-09D2-4C25-8F59-CA4703B77FFE@gmail.com> <568E2BCA.2060608@averbis.com> <10BD374F-DF7E-41E3-807D-8AEA6EAE5BEB@gmail.com> <568E2CD3.4040805@averbis.com> To: user@uima.apache.org X-Mailer: Apple Mail (2.3112) Yes, where do we sign this? :-) > On 07 Jan 2016, at 10:16 , Peter Kl=C3=BCgl = wrote: >=20 > :-) let me know if you need help or have any questions. >=20 > Am 07.01.2016 um 10:12 schrieb Mario Gazzo: >> Yes, let us just sign and submit it. >>=20 >>> On 07 Jan 2016, at 10:11 , Peter Kl=C3=BCgl = wrote: >>>=20 >>> Hi, >>>=20 >>> thanks, that would be great. Patches are simply attached to the = issue. >>> Non-trivial changes require an ICLA. Do you want to sign and submit = it? >>>=20 >>> Best, >>>=20 >>> Peter >>>=20 >>>=20 >>> Am 07.01.2016 um 10:08 schrieb Mario Gazzo: >>>> Thanks, >>>>=20 >>>> I just added the JIRA issue: = https://issues.apache.org/jira/browse/UIMA-4729 = >>>>=20 >>>> If you like, then we can also implement it and submit a patch, just = let us know what the process is. >>>>=20 >>>> Cheers >>>> Mario >>>>=20 >>>>> On 07 Jan 2016, at 09:08 , Peter Kl=C3=BCgl = wrote: >>>>>=20 >>>>> Hi, >>>>>=20 >>>>> Am 06.01.2016 um 14:48 schrieb Mario Gazzo: >>>>>> Hi Peter, >>>>>>=20 >>>>>> I had a look at the test cases and I think there are many = interesting and useful features that cover many of our use cases but I = will have to experiment with them before I know what might be missing. I = have a few questions though: >>>>>>=20 >>>>>> 1) It appears that we would then also be able to assign = annotations to lists, which is nice. I am not sure from looking at the = tests whether it is possible to use ADD with the annotation lists but I = assume so. >>>>> Not yet, but I will implement it. It's still work in progress. But >>>>> thanks for pointing it out, I would probably have forgotten about = it. >>>>>=20 >>>>>> 2) The use of addresses is unclear to me just from reading the = test, maybe you could explain them.? This concept is very new to me. >>>>> It's not intented be to utilized directly in a rule file. It's = rather >>>>> just a way to combine logic in java with ruta rules or use ruta >>>>> functionality in java code. >>>>> Let's say we have a new method like >>>>> boolean Ruta.matches(CAS cas, String rule, AnnotationFS... = annotations) >>>>> and you call it with something like (syntax is not yet specified) >>>>> Ruta.matches(cas, "${PARTOF(Headline)} Keyword;", annotation) >>>>> Then, the "$" would be replaced by the address of the annotation = and the >>>>> method would return whether the annotation is covered by a = Headline >>>>> annotation and is followed by a Keyword annotation. >>>>>=20 >>>>>> 3) The annotation feature expression looks nice but I wonder = whether an array element can also be referenced using an int expression = and not just a constant e.g. Struct.as[intVar+1]{->T1}; >>>>> Yes, without allowing number expressions, it would not really be = useful. >>>>> The current implementation is just a test in order to check = whether the >>>>> internal object model is good enough to cover it. The complete >>>>> functionality will probably not be included in the next release = since >>>>> there is still much work left in order to get it up and running. = The >>>>> semantics of such expressions (Struct.as) are resolved on the fly, = and >>>>> the code odes not support expressions at all. I still have to = think >>>>> about a way to implement it. >>>>>=20 >>>>>> The label expressions are also useful and will make some of our = rules more readable. >>>>>>=20 >>>>>> Finally I have one additional question to the MARKUP = initialisation. I have a case where I need the token seeds coming from = the default seeder but I don=E2=80=99t want to run the markup = initialisation. Is there a separate seeder defined for this somewhere? = Right now I have my own copy of the default seeder without the MARKUP = initialisation but obviously I do not want to maintain this. It looks as = if they could also be split in two seeders with both added as default = and then I could overwrite with my own seeder list containing only the = token seeder. >>>>> Yes, we can split them or just add another one that ignores = markup. I >>>>> was also always thinking about adding a DetailedSeeder that = creates much >>>>> more finegrained types like different brackets and quotes... but = it was >>>>> never on top of my todo list. >>>>>=20 >>>>> Do you want to open a jira issue for it? >>>>>=20 >>>>> Best, >>>>>=20 >>>>> Peter >>>>>=20 >>>>>> Cheers >>>>>> Mario >>>>>>=20 >>>>>>=20 >>>>>>> On 04 Jan 2016, at 17:06 , Peter Kl=C3=BCgl = wrote: >>>>>>>=20 >>>>>>> Hi, >>>>>>>=20 >>>>>>> Am 04.01.2016 um 16:13 schrieb Mario Gazzo: >>>>>>>> Hi Peter, >>>>>>>>=20 >>>>>>>> No problem, I was anyway pretty much offline myself during = Christmas holidays. >>>>>>>>=20 >>>>>>>> The term =E2=80=9Coverhead=E2=80=9D is probably an exaggeration = in this context especially after I disabled the MARKUP initialisation. = We implemented earlier our own XML markup annotator tailored to better = fit our needs with additional annotation types and properties, so the = Ruta MARKUP is currently not used. It just happens that we don=E2=80=99t = directly use RutaBasic in any of our rules in this particular case so I = was curious to know whether we could avoid creating them in the first = place since there seems to be quite a few. However, overall processing = required by our Ruta scripts compared to other processing steps is now = small and sub-optimising this further by making RutaBasic optional would = currently be of very low priority to us. We would prioritise other = features higher e.g. being able to assign annotations to variables as we = discussed previously in another thread. >>>>>>> I am working on this right now and there is finally some first = progress :-) >>>>>>>=20 >>>>>>> I fear that I won't catch all use cases (combinations with = language >>>>>>> elements) with the first attempt. If you are interested (and = wanna take >>>>>>> care I do not miss your use case), feel free to take a look at = the new >>>>>>> unit tests: >>>>>>> = https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/o= rg/apache/uima/ruta/expression/annotation >>>>>>>=20 >>>>>>> It's still work in progress. Proposals for more unit tests are = very welcome. >>>>>>>=20 >>>>>>>> We haven=E2=80=99t processed documents as large as those you = mention since books have so far been divided into chapters and = processing could therefore be parallelised accordingly. We also drop = extreme outliers above a certain size if we encounter them and then we = batch process them later in smaller chunks but this has so far not been = necessary with our current data sets. Like you, our processing = bottlenecks are now in different components. >>>>>>> Ah, that's nice to hear that ruta is not the bottleneck :-D >>>>>>>=20 >>>>>>> Best, >>>>>>>=20 >>>>>>> Peter >>>>>>>=20 >>>>>>>=20 >>>>>>>> Cheers >>>>>>>> Mario >>>>>>>>=20 >>>>>>>>> On 30 Dec 2015, at 16:44 , Peter Kl=C3=BCgl = wrote: >>>>>>>>>=20 >>>>>>>>> Hi, >>>>>>>>>=20 >>>>>>>>> sorry for the delayed reply. >>>>>>>>>=20 >>>>>>>>> RutaEngine::initializeStream: >>>>>>>>>=20 >>>>>>>>> The special treatment of MARKUPs that causes the increased = time required for initialization is just a workaround because I was to = lazy to write a working jflex rule. Well, I tried but failed. It = shouldn't be hard be to improve this code... I will create an issue for = it. When I did the last performance optimization, uima did not check the = indexes yet and my test set did not contain markups. >>>>>>>>>=20 >>>>>>>>> Deactivate creation of RutaBasic: >>>>>>>>> Short answer is no. I was already thinking about making = RutaBasic optional in future so that the user can configure if they are = used. However, right now, they are required for rule inference and make = the rule inference "fast" in the first place. RutaBasic is just an = internal annotation like RutaAnnotation (for SCORE, MARKSCORE) and = RutaFrame, and rules should not match on them at all. >>>>>>>>>=20 >>>>>>>>> Some background information: >>>>>>>>>=20 >>>>>>>>> RutaBasics are used for three things: >>>>>>>>> - store additional information in order to avoid index = operations. Some useful conditions would require many index operations, = e.g., PARTOF or ENDSWITH. RutaBasic is utilized as a cache what = annotations start and end at which position, and which positions are = covered by which types. >>>>>>>>> - provide a container to make this information available = across analysis engines. Information shared by analysis engine is = normally stored in the CAS, e.g. in annotations, (or in external = resources). This is the role of RutaBasic. It is not really implemented = right now as it should be but I will improve it soon. Then, there is no = performance decrease when a pipeline is spammed with small ruta engines. >>>>>>>>> - a basic minimal disjunct partitioning of the document for = the coverage based visibility concept. >>>>>>>>>=20 >>>>>>>>> Making RutaBasic optional is possible. If there is a real need = for it, e.g., in order to reduce the memory footprint or when processing = large documents where parts are simply not interesting, then I will put = it on my TODO list. I am also open for other/new ideas how to solve the = challenges (and for incremental usage of internal caches). >>>>>>>>>=20 >>>>>>>>> What is your experience with the processing overhead = concerning RutaBasic? Is it the rule matching or rather the = initialization? I myself had already some performance problems with the = initalization and memory consumption in large CAS (500+ pages pdfs). = However, other components, serialization and the CAS editor were the = actual bottlenecks. >>>>>>>>>=20 >>>>>>>>> Best, >>>>>>>>>=20 >>>>>>>>> Peter >>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>> Am 22.12.2015 um 17:26 schrieb Mario Gazzo: >>>>>>>>>> I got around it by removing the default seeders by specifying = an empty seeders list since we don=E2=80=99t need the MARKUP annotations = anymore. >>>>>>>>>>=20 >>>>>>>>>> I still don=E2=80=99t know why it created so much overhead = but it sometimes seemed to rival the POS tagger in processing time. >>>>>>>>>>=20 >>>>>>>>>> Anyway, this leads me to the next question. Can I disable the = creation of Ruta basic annotations entirely to save processing overhead = and only apply Ruta rules to other annotation types created by other AEs = such as our own? >>>>>>>>>>=20 >>>>>>>>>> Cheers >>>>>>>>>> Mario >>>>>>>>>>=20 >>>>>>>>>>> On 21 Dec 2015, at 16:09 , Mario Juric = wrote: >>>>>>>>>>>=20 >>>>>>>>>>> Hi Peter, >>>>>>>>>>>=20 >>>>>>>>>>> I noticed that occasionally the initialisation in = RutaEngine::initializeStream can tak very long time. I can=E2=80=99t = really explain them and it seems independent of document length since I = have seen this with even very small XML documents. >>>>>>>>>>>=20 >>>>>>>>>>> The method seems to spend much time in the DefaultSeeder = when creating MARKUP annotations during subiterator.moveToNext calls = (line 89) and inside Subiterator it seems to be the while loop inside = adjustForStrictForward (line 232), which is inside UIMA core classes. I = haven=E2=80=99t gone into any deeper analysis yet but I first like to = hear whether you have an idea what could be the main cause(s) for this? >>>>>>>>>>>=20 >>>>>>>>>>> We use Ruta 2.3.1 with UIMA 2.8.1 >>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>> Cheers >>>>>>>>>>> Mario >=20