Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 77A4018460 for ; Thu, 22 Oct 2015 09:05:20 +0000 (UTC) Received: (qmail 26718 invoked by uid 500); 22 Oct 2015 09:05:15 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 26673 invoked by uid 500); 22 Oct 2015 09:05:15 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 26655 invoked by uid 99); 22 Oct 2015 09:05:14 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Oct 2015 09:05:14 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 56F60180A27 for ; Thu, 22 Oct 2015 09:05:14 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.891 X-Spam-Level: ** X-Spam-Status: No, score=2.891 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, T_RP_MATCHES_RCVD=-0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=uni-jena.de Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id TTEelBsNPA8B for ; Thu, 22 Oct 2015 09:05:03 +0000 (UTC) Received: from smtpout0.rz.uni-jena.de (smtpout0.rz.uni-jena.de [141.35.34.37]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 74558439B6 for ; Thu, 22 Oct 2015 09:05:03 +0000 (UTC) Received: from smtpin1.rz.uni-jena.de (smtpin0.rz.uni-jena.de [141.35.35.37]) by smtpout0.rz.uni-jena.de (Postfix) with ESMTPS id 3nhN7L2YZ9z2yjq for ; Thu, 22 Oct 2015 11:05:02 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=uni-jena.de; s=opendkim-2015; t=1445504702; bh=Mj6qOQWtbrYzmFcEEecWaiesaeQ8OdxOAQFUCah10b8=; h=From:Subject:Date:References:To:In-Reply-To:From; b=ElThJhMf4G3ljNmTqVLxdLoOAkSh79v0yrK9jbgoG/+I+wc9vgSqoKtqcG/Obtww9 QR6nytgQSsBBkOgq9T53ISf9YYX/SWzt/0EHCeyC3WGsmlivx+iVauiM2PzCbanHJd E9LikJTvn7h9O5t5TGFlNt9Gfr4wfah6PWfPud+c= Received: from dyn-0a238cab.philo.uni-jena.de (dyn-0a238cab.philo.uni-jena.de [10.35.140.171]) by smtpin1.rz.uni-jena.de (Postfix) with ESMTPSA id 3nhN7L1lTBz2K for ; Thu, 22 Oct 2015 11:05:02 +0200 (CEST) From: =?utf-8?Q?Erik_F=C3=A4=C3=9Fler?= Content-Type: multipart/alternative; boundary="Apple-Mail=_0D196569-A61A-419C-857D-AA8A27FE14DE" Message-Id: <3AF412F9-7CCE-40E3-A2D3-6731EC9CFA05@uni-jena.de> Mime-Version: 1.0 (Mac OS X Mail 9.0 \(3094\)) Subject: Re: Performance of UIMAfit JCasUtil.selectCovered() and variants Date: Thu, 22 Oct 2015 11:05:01 +0200 References: <07371163-1101-4350-93DB-849B272935C6@uni-jena.de> <40AACD2C-F212-4FE8-9B69-BC89084518F9@apache.org> To: user@uima.apache.org In-Reply-To: <40AACD2C-F212-4FE8-9B69-BC89084518F9@apache.org> X-Mailer: Apple Mail (2.3094) --Apple-Mail=_0D196569-A61A-419C-857D-AA8A27FE14DE Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Small follow-up, just stumbled across by chance, did not even search for = it: = http://searchivarius.org/blog/selectcovered-substantially-better-version-u= ima-subiterator = Someone has done a performance comparison. It does not necessarily = strike me to be the most sophisticated approach, but the code is = available and one could use this as a hint. Best, Erik > On 21 Oct 2015, at 18:26, Richard Eckart de Castilho = wrote: >=20 > Hi, >=20 > 1 uses the UIMA indexes which I believe use a binary search, so it = should > be something like O(log n). >=20 > 2 is in principle O(n) but since it does a linear scan from the = beginning > and stops when no further annotations may be found, it practice O(n)=20= > should be the upper bound when called for annotations towards the end = of > the document. >=20 > 3 is fastest for repeated use. It should be O(n) for creating > the index and then uses hashmap lookups. >=20 > So 1 and 3 are better than two. >=20 > If you need speed and need coverage information a lot, 3 should be the = best. >=20 > 1 and 2 are more convenient for coding. >=20 > If you use plain UIMA and have type priorities set up, then using an > iterator over sentences and a subiterator over tokens is likely to > be better than 3 because it doesn't need the initial scan that 3 does. >=20 > I'm not aware that anybody did extensive performance comparisons here. > Some are being done in = org.apache.uima.fit.util.JCasUtilTest.testSelectCoverRandom() > which compares 1 and 2. Here a few lines from the test output (mind to = increase the > ITERATIONS variable if you try): >=20 > ... > Speed up factor 5.50 [naive:11 optimized:2 diff:9] > Speed up factor 6.67 [naive:20 optimized:3 diff:17] > Speed up factor 4.00 [naive:16 optimized:4 diff:12] > Speed up factor 2.50 [naive:30 optimized:12 diff:18] > Speed up factor 7.00 [naive:35 optimized:5 diff:30] > Speed up factor 5.63 [naive:45 optimized:8 diff:37] > Speed up factor 7.78 [naive:70 optimized:9 diff:61] > Speed up factor 8.09 [naive:89 optimized:11 diff:78] > ... >=20 > Cheers, >=20 > -- Richard >=20 >> On 21.10.2015, at 17:07, Erik F=C3=A4=C3=9Fler = wrote: >>=20 >> Hi all, >>=20 >> I=E2=80=99m wondering about the performance differences between >>=20 >> 1) JCasUtil.selectCovered(JCas, Class, AnnotationFS), >> 2) JCasUtil.selectCovered(JCas, Class, int, int) and >> 3) JCasUtil.indexCovered(JCas, Class, Class) >>=20 >> It is clear that 3) iterates once through the CAS and just returns a = map. Once this is done, map access is swift. >>=20 >> The Javadoc of 2) states that it is slower than 1). >> 3) states that it is preferable to 2). >>=20 >> Questions: >> Is 3) also preferable over 2) when there is only one covering = annotation or is the performance of 2) and 3) roughly equal then? >> Main question: Is 3) also quicker than 1) if there are many covering = annotations? >>=20 >> Use case: I want to iterate through all sentences in paragraphs. = Normally, I would use subiterators(), but the known type priority issue = could be a problem for me. Should I just use 1)? Or would I still = benefit from 3) if I have more than one paragraph? >>=20 >> Thank you very much! >>=20 >> Best, >>=20 >> Erik >=20 --Apple-Mail=_0D196569-A61A-419C-857D-AA8A27FE14DE--