uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Fäßler <erik.faess...@uni-jena.de>
Subject Re: Performance of UIMAfit JCasUtil.selectCovered() and variants
Date Thu, 22 Oct 2015 06:44:47 GMT
Very nice, thank you so much :-)

> On 21 Oct 2015, at 18:26, Richard Eckart de Castilho <rec@apache.org> wrote:
> 
> Hi,
> 
> 1 uses the UIMA indexes which I believe use a binary search, so it should
> be something like O(log n).
> 
> 2 is in principle O(n) but since it does a linear scan from the beginning
> and stops when no further annotations may be found, it practice O(n) 
> should be the upper bound when called for annotations towards the end of
> the document.
> 
> 3 is fastest for repeated use. It should be O(n) for creating
> the index and then uses hashmap lookups.
> 
> So 1 and 3 are better than two.
> 
> If you need speed and need coverage information a lot, 3 should be the best.
> 
> 1 and 2 are more convenient for coding.
> 
> If you use plain UIMA and have type priorities set up, then using an
> iterator over sentences and a subiterator over tokens is likely to
> be better than 3 because it doesn't need the initial scan that 3 does.
> 
> I'm not aware that anybody did extensive performance comparisons here.
> Some are being done in org.apache.uima.fit.util.JCasUtilTest.testSelectCoverRandom()
> which compares 1 and 2. Here a few lines from the test output (mind to increase the
> ITERATIONS variable if you try):
> 
> ...
> Speed up factor 5.50 [naive:11 optimized:2 diff:9]
> Speed up factor 6.67 [naive:20 optimized:3 diff:17]
> Speed up factor 4.00 [naive:16 optimized:4 diff:12]
> Speed up factor 2.50 [naive:30 optimized:12 diff:18]
> Speed up factor 7.00 [naive:35 optimized:5 diff:30]
> Speed up factor 5.63 [naive:45 optimized:8 diff:37]
> Speed up factor 7.78 [naive:70 optimized:9 diff:61]
> Speed up factor 8.09 [naive:89 optimized:11 diff:78]
> ...
> 
> Cheers,
> 
> -- Richard
> 
>> On 21.10.2015, at 17:07, Erik Fäßler <erik.faessler@uni-jena.de> wrote:
>> 
>> Hi all,
>> 
>> I’m wondering about the performance differences between
>> 
>> 1) JCasUtil.selectCovered(JCas, Class<T>, AnnotationFS),
>> 2) JCasUtil.selectCovered(JCas, Class<T>, int, int) and
>> 3) JCasUtil.indexCovered(JCas, Class<T>, Class<S>)
>> 
>> It is clear that 3) iterates once through the CAS and just returns a map. Once this
is done, map access is swift.
>> 
>> The Javadoc of 2) states that it is slower than 1).
>> 3) states that it is preferable to 2).
>> 
>> Questions:
>> Is 3) also preferable over 2) when there is only one covering annotation or is the
performance of 2) and 3) roughly equal then?
>> Main question: Is 3) also quicker than 1) if there are many covering annotations?
>> 
>> Use case: I want to iterate through all sentences in paragraphs. Normally, I would
use subiterators(), but the known type priority issue could be a problem for me. Should I
just use 1)? Or would I still benefit from 3) if I have more than one paragraph?
>> 
>> Thank you very much!
>> 
>> Best,
>> 
>> Erik
> 


Mime
View raw message