uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kline, Larry" <Larry.Kl...@mckesson.com>
Subject RE: FilteredIterator is very slow
Date Tue, 01 Apr 2014 20:01:49 GMT
Thomas,

Thanks for the suggestions.  Before I read your email I happened on this method, FSUtil.getAnnotationsIteratorInSpan(aJCas,
annotationType, begin, end).  I pass it the typeId of the type I want to check and the begin
and end of the dictTerm's span.  I need to run some tests to make sure it is doing what I
assume it is, but if it is correct, it has much better performance than my previous code.
 Perhaps this is using the method you suggest.  Yes, my dictTerms are much more numerous than
the filter types but currently iterate over them one at a time in the caller, ask if any of
the filter types overlaps, and if so discard it.  If it's not discarded I go on to do other
things with it.  I could probably preprocess the dictTerms first by checking them for overlap
against the filter types and remove the matches.  Why do you think this would be faster? 
Isn't it MxN versus NxM?

Thanks again,
Larry

-----Original Message-----
From: Thomas Ginter [mailto:thomas.ginter@utah.edu] 
Sent: Monday, March 31, 2014 12:56 PM
To: user@uima.apache.org
Subject: Re: FilteredIterator is very slow

Larry,

A faster way to get the list of types that you will skip would be to do the following:

FSIndex<TitlePersonHonorificAnnotation> titlePersonHAIndex = aJCas.getAnnotationIndex(TitlePersonHonorificAnnotation.type);

Doing this for each type will yield an index that points to just the annotations in the CAS
of each type you are interested in.  From there you can get an iterator reference ( titlePersonHAIndex.iterator()
) and either traverse each one separately or else add them to a common Collection such as
an ArrayList and iterate through that.  You could also take advantage of the fact that the
default index in UIMA sorts on ascending order on the begin index and descending order on
the ending index to stop once you have traversed the list past the ending index of the dictTerm.
 

An important design decision though would be to consider whether the dictTerm annotations
are much more numerous than the TitlePersonHonorificAnnotation, MeasurementAnnotation, and
ProgFactorTerm filtering annotation types.  Generally if the filter types are much more plentiful
and the dictTerm type was more rare then looking for overlapping filter types will yield fewer
iterations of your algorithm, however if there are a lot of dictTerm occurrences and only
a few of the filter types then it may be more efficient to iterate through the filter types
and eliminate dictTerms that overlap or are covered.  

Thanks,

Thomas Ginter
801-448-7676
thomas.ginter@utah.edu




On Mar 31, 2014, at 11:47 AM, Kline, Larry <Larry.Kline@mckesson.com> wrote:

> When I use a filtered FSIterator it's an order of magnitude slower than a non-filtered
iterator.  Here's my code:
> 
> Create the iterator:
>       private FSIterator<Annotation> createConstrainedIterator(JCas aJCas) throws
CASException {
>              FSIterator<Annotation> it = aJCas.getAnnotationIndex().iterator();
>              FSTypeConstraint constraint = aJCas.getConstraintFactory().createTypeConstraint();
>              constraint.add((new TitlePersonHonorificAnnotation(aJCas)).getType());
>              constraint.add((new MeasurementAnnotation(aJCas)).getType());
>              constraint.add((new ProgFactorTerm(aJCas)).getType());
>              it = aJCas.createFilteredIterator(it, constraint);
>              return it;
>       }
> Use the iterator:
>       public void process(JCas aJCas) throws AnalysisEngineProcessException {
>              ...
> // The following is done in a loop
>                           if (shouldSkip(dictTerm, skipIter))
>                                  continue;
>              ...
>       }
> Here's the method called:
>       private boolean shouldSkip(G2DictTerm dictTerm, FSIterator<Annotation> skipIter)
throws CASException {
>              boolean shouldSkip = false;
>              skipIter.moveToFirst();
>              while (skipIter.hasNext()) {
>                     Annotation annotation = skipIter.next();
>                     if (UIMAUtils.annotationsOverlap(dictTerm, annotation)) {
>                           shouldSkip = true;
>                           break;
>                     }
>              }
>              return shouldSkip;
>       }
> 
> If I change the method, createConstrainedIterator(), to this (that is, no constraints):
>       private FSIterator<Annotation> createConstrainedIterator(JCas aJCas) throws
CASException {
>              FSIterator<Annotation> it = aJCas.getAnnotationIndex().iterator();
>              return it;
>       }
> 
> It runs literally 10 times faster.  Doing some profiling I see that all of the time is
spent in the skipIter.moveToFirst() call.  I also tried creating the filtered iterator each
time anew in the shouldSkip() method instead of passing it in, but that has even slightly
worse performance.
> 
> Given this performance I suppose I should probably use a non-filtered iterator and just
check for the types I'm interested in inside the loop.
> 
> Any other suggestions welcome.
> 
> Thanks,
> Larry Kline
> 
> 


Mime
View raw message