Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3E2FA187B5 for ; Mon, 4 Jan 2016 16:06:24 +0000 (UTC) Received: (qmail 20309 invoked by uid 500); 4 Jan 2016 16:06:24 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 20261 invoked by uid 500); 4 Jan 2016 16:06:23 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 20250 invoked by uid 99); 4 Jan 2016 16:06:23 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Jan 2016 16:06:23 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 2DFB6C023A for ; Mon, 4 Jan 2016 16:06:23 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1 X-Spam-Level: * X-Spam-Status: No, score=1 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_MSPIKE_H2=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 4Dgiqu7ZyfTH for ; Mon, 4 Jan 2016 16:06:15 +0000 (UTC) Received: from mout.kundenserver.de (mout.kundenserver.de [217.72.192.74]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id F18F2201EB for ; Mon, 4 Jan 2016 16:06:14 +0000 (UTC) Received: from [192.168.11.108] ([132.230.176.14]) by mrelayeu.kundenserver.de (mreue103) with ESMTPSA (Nemesis) id 0MMpSB-1aJ8393AfX-008YYA for ; Mon, 04 Jan 2016 17:06:12 +0100 Subject: Re: Very long Ruta stream initialization To: user@uima.apache.org References: <5683FBCE.7020807@averbis.com> <91BA95B1-DF4C-459F-BF59-7A21F2DC585B@gmail.com> From: =?UTF-8?Q?Peter_Kl=c3=bcgl?= X-Enigmail-Draft-Status: N1110 Message-ID: <568A987E.1050205@averbis.com> Date: Mon, 4 Jan 2016 17:06:22 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 In-Reply-To: <91BA95B1-DF4C-459F-BF59-7A21F2DC585B@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Provags-ID: V03:K0:fL78ZsCLuXI13Dv5CoHP7nhauigxf+FzIlv7ZED2Lt9oddQfIAm r1ZPa6VQerwO7/WYsZ8rYFhc2BX3DcI8ZZqAlBx4GNOxJmEunkKP2wMPT1Sk7/mrQ6uXKUW /l744BOoMqriw/UI+pKOzPajWR12oXBAdfOGrYPoYphTlfQLgb4Tt9qevZKylcxlufjDqHi KU5WUPnto0MyG2S1MQUjA== X-UI-Out-Filterresults: notjunk:1;V01:K0:eP08TM2OkoE=:DdPO1XL8Z0P0xve3zvZ72h d3DcLjW/fJw/bSw+9MbhXwSZT0SmcWnC5YgpxBl68rM6r+e8R6dzWrfVi/u5H6L41WHqelqzW zsUxDFEXKNR5wur2wR0NnI04KlYPn7M6RL08hsGT5VLqyVA0Tpmyjh2qbOgbK30N4BrdKJ5gS QFCOHZgOTaSlHVFScyRkMa9rFzMN0P2k/K4lpxbegHRDWsC8C5iZiMX3irVqELpbMsFsRaqd7 EwgJb125MfrfA51IZnRHt03gPx3IFUzNWRTgsod0fuFCm5GOCJrBNGaIFVD0ju5vQfegRvjPW rAeEYa291jEOj2tuBnG+oiO5vdFDqEWNKraPKugSbKdNqBTYxkbExETOA3VNSt1wQdYqvQAr4 UKnZPdqS+h2ivE/8SUS6RqPTODTuaylhRVz27VrVNCIlu5RC/ZWdW3J+8V5iTjkgMjj9zMPP4 rubayLCStU0kITIOmC/FIheG3wuCQiNIjgpMBW+NXkk3f6SqrpSQVeB0jdV4YsjhTtPkDhEOT Tt5VAh0/U454JJNQ2P9CPhgRhpHFHXS9Hxnw6QX851cqLjPclY4vNM7s7jkeQqgEjjyroHUFX yWzlp1rPT+GlpeSp7Uu3yRDLgM24vrUSnYXeawzhS/gtJ5SaeJMtqtBnyCZxyNmFLUveBscTu 5uZg3rJuqFNb/lbmdc6FlkyxVFlTmwT4nUtMwXoDYx4IWis2TyHs2tnI085fQ91YGUUUaffdh aedxL6GeIjivf/1a Hi, Am 04.01.2016 um 16:13 schrieb Mario Gazzo: > Hi Peter, > > No problem, I was anyway pretty much offline myself during Christmas holidays. > > The term “overhead” is probably an exaggeration in this context especially after I disabled the MARKUP initialisation. We implemented earlier our own XML markup annotator tailored to better fit our needs with additional annotation types and properties, so the Ruta MARKUP is currently not used. It just happens that we don’t directly use RutaBasic in any of our rules in this particular case so I was curious to know whether we could avoid creating them in the first place since there seems to be quite a few. However, overall processing required by our Ruta scripts compared to other processing steps is now small and sub-optimising this further by making RutaBasic optional would currently be of very low priority to us. We would prioritise other features higher e.g. being able to assign annotations to variables as we discussed previously in another thread. I am working on this right now and there is finally some first progress :-) I fear that I won't catch all use cases (combinations with language elements) with the first attempt. If you are interested (and wanna take care I do not miss your use case), feel free to take a look at the new unit tests: https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation It's still work in progress. Proposals for more unit tests are very welcome. > We haven’t processed documents as large as those you mention since books have so far been divided into chapters and processing could therefore be parallelised accordingly. We also drop extreme outliers above a certain size if we encounter them and then we batch process them later in smaller chunks but this has so far not been necessary with our current data sets. Like you, our processing bottlenecks are now in different components. Ah, that's nice to hear that ruta is not the bottleneck :-D Best, Peter > Cheers > Mario > >> On 30 Dec 2015, at 16:44 , Peter Klügl wrote: >> >> Hi, >> >> sorry for the delayed reply. >> >> RutaEngine::initializeStream: >> >> The special treatment of MARKUPs that causes the increased time required for initialization is just a workaround because I was to lazy to write a working jflex rule. Well, I tried but failed. It shouldn't be hard be to improve this code... I will create an issue for it. When I did the last performance optimization, uima did not check the indexes yet and my test set did not contain markups. >> >> Deactivate creation of RutaBasic: >> Short answer is no. I was already thinking about making RutaBasic optional in future so that the user can configure if they are used. However, right now, they are required for rule inference and make the rule inference "fast" in the first place. RutaBasic is just an internal annotation like RutaAnnotation (for SCORE, MARKSCORE) and RutaFrame, and rules should not match on them at all. >> >> Some background information: >> >> RutaBasics are used for three things: >> - store additional information in order to avoid index operations. Some useful conditions would require many index operations, e.g., PARTOF or ENDSWITH. RutaBasic is utilized as a cache what annotations start and end at which position, and which positions are covered by which types. >> - provide a container to make this information available across analysis engines. Information shared by analysis engine is normally stored in the CAS, e.g. in annotations, (or in external resources). This is the role of RutaBasic. It is not really implemented right now as it should be but I will improve it soon. Then, there is no performance decrease when a pipeline is spammed with small ruta engines. >> - a basic minimal disjunct partitioning of the document for the coverage based visibility concept. >> >> Making RutaBasic optional is possible. If there is a real need for it, e.g., in order to reduce the memory footprint or when processing large documents where parts are simply not interesting, then I will put it on my TODO list. I am also open for other/new ideas how to solve the challenges (and for incremental usage of internal caches). >> >> What is your experience with the processing overhead concerning RutaBasic? Is it the rule matching or rather the initialization? I myself had already some performance problems with the initalization and memory consumption in large CAS (500+ pages pdfs). However, other components, serialization and the CAS editor were the actual bottlenecks. >> >> Best, >> >> Peter >> >> >> Am 22.12.2015 um 17:26 schrieb Mario Gazzo: >>> I got around it by removing the default seeders by specifying an empty seeders list since we don’t need the MARKUP annotations anymore. >>> >>> I still don’t know why it created so much overhead but it sometimes seemed to rival the POS tagger in processing time. >>> >>> Anyway, this leads me to the next question. Can I disable the creation of Ruta basic annotations entirely to save processing overhead and only apply Ruta rules to other annotation types created by other AEs such as our own? >>> >>> Cheers >>> Mario >>> >>>> On 21 Dec 2015, at 16:09 , Mario Juric wrote: >>>> >>>> Hi Peter, >>>> >>>> I noticed that occasionally the initialisation in RutaEngine::initializeStream can tak very long time. I can’t really explain them and it seems independent of document length since I have seen this with even very small XML documents. >>>> >>>> The method seems to spend much time in the DefaultSeeder when creating MARKUP annotations during subiterator.moveToNext calls (line 89) and inside Subiterator it seems to be the while loop inside adjustForStrictForward (line 232), which is inside UIMA core classes. I haven’t gone into any deeper analysis yet but I first like to hear whether you have an idea what could be the main cause(s) for this? >>>> >>>> We use Ruta 2.3.1 with UIMA 2.8.1 >>>> >>>> >>>> Cheers >>>> Mario