uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Eckart de Castilho <...@apache.org>
Subject Re: Performance of UIMAfit JCasUtil.selectCovered() and variants
Date Wed, 21 Oct 2015 16:26:28 GMT
Hi,

1 uses the UIMA indexes which I believe use a binary search, so it should
be something like O(log n).

2 is in principle O(n) but since it does a linear scan from the beginning
and stops when no further annotations may be found, it practice O(n) 
should be the upper bound when called for annotations towards the end of
the document.

3 is fastest for repeated use. It should be O(n) for creating
the index and then uses hashmap lookups.

So 1 and 3 are better than two.

If you need speed and need coverage information a lot, 3 should be the best.

1 and 2 are more convenient for coding.

If you use plain UIMA and have type priorities set up, then using an
iterator over sentences and a subiterator over tokens is likely to
be better than 3 because it doesn't need the initial scan that 3 does.

I'm not aware that anybody did extensive performance comparisons here.
Some are being done in org.apache.uima.fit.util.JCasUtilTest.testSelectCoverRandom()
which compares 1 and 2. Here a few lines from the test output (mind to increase the
ITERATIONS variable if you try):

...
Speed up factor 5.50 [naive:11 optimized:2 diff:9]
Speed up factor 6.67 [naive:20 optimized:3 diff:17]
Speed up factor 4.00 [naive:16 optimized:4 diff:12]
Speed up factor 2.50 [naive:30 optimized:12 diff:18]
Speed up factor 7.00 [naive:35 optimized:5 diff:30]
Speed up factor 5.63 [naive:45 optimized:8 diff:37]
Speed up factor 7.78 [naive:70 optimized:9 diff:61]
Speed up factor 8.09 [naive:89 optimized:11 diff:78]
...

Cheers,

-- Richard

> On 21.10.2015, at 17:07, Erik Fäßler <erik.faessler@uni-jena.de> wrote:
> 
> Hi all,
> 
> I’m wondering about the performance differences between
> 
> 1) JCasUtil.selectCovered(JCas, Class<T>, AnnotationFS),
> 2) JCasUtil.selectCovered(JCas, Class<T>, int, int) and
> 3) JCasUtil.indexCovered(JCas, Class<T>, Class<S>)
> 
> It is clear that 3) iterates once through the CAS and just returns a map. Once this is
done, map access is swift.
> 
> The Javadoc of 2) states that it is slower than 1).
> 3) states that it is preferable to 2).
> 
> Questions:
> Is 3) also preferable over 2) when there is only one covering annotation or is the performance
of 2) and 3) roughly equal then?
> Main question: Is 3) also quicker than 1) if there are many covering annotations?
> 
> Use case: I want to iterate through all sentences in paragraphs. Normally, I would use
subiterators(), but the known type priority issue could be a problem for me. Should I just
use 1)? Or would I still benefit from 3) if I have more than one paragraph?
> 
> Thank you very much!
> 
> Best,
> 
> Erik


Mime
View raw message