incubator-ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Masanz, James J." <Masanz.Ja...@mayo.edu>
Subject RE: assistance with dictionary lookup issue
Date Tue, 05 Feb 2013 15:38:58 GMT
First, about the loop - I had been looking too quickly at the diff and didn't notice the logic
about punctuation etc

Second, what I remember when I looked at it before, was seeing interface named LookupInitializer,
which being old enough, doesn't have Iterator parameterized in the definition of the getLookupTokenIterator:
    public Iterator getLookupTokenIterator(JCas jcas)
            throws AnnotatorInitializationException;

and that ends up being effectively an Iterator<LookupToken> and LookupToken does not
inherit from Annotation, and I stopped at that point. 

But now looking farther, it looks to me that that's fine because in FirstTokenPermLookupInitializerImpl,
we look through the BaseTokens and create the list of LookupTokens based on the (sorted) BaseTokens,
so the LookupTokens will be sorted too.

so maybe we should call the new method getSortedLookupTokens to make it clear they too are
sorted

________________________________________
From: ctakes-dev-return-1146-Masanz.James=mayo.edu@incubator.apache.org [ctakes-dev-return-1146-Masanz.James=mayo.edu@incubator.apache.org]
on behalf of Tim Miller [timothy.miller@childrens.harvard.edu]
Sent: Tuesday, February 05, 2013 9:10 AM
To: ctakes-dev@incubator.apache.org
Subject: Re: assistance with dictionary lookup issue

Yeah, if you mean just change the loop to iterate over the list instead
of getting an iterator that makes sense.  There is still some logic in
there to leave out punctuation tokens but I think you were implying that
to be in your mockup diff.

As for sorting, the AnnotationIndex defines a sort order for its iterators:
http://uima.apache.org/d/uimaj-2.4.0/apidocs/org/apache/uima/cas/text/AnnotationIndex.html

so we are safe assuming that anything extending Annotation will be
iterated in sorted order.  Does that answer the questions we had? Or was
I missing something subtle in that discussion?

Tim

On 02/05/2013 09:44 AM, Masanz, James J. wrote:
> Looks good to me, with one question.
>
> Instead of getting an iterator and then building a new list, can we just skip getting
the iterator and use the list that selectCovered returns?
>
> I will mock up a diff here of what I mean:
> -     Iterator btaItr = org.uimafit.util.JCasUtil.selectCovered(jcas, BaseToken.class,
covering).iterator();
> -     while (btaItr.hasNext())
> -             {
> -                     BaseToken bta = (BaseToken) btaItr.next();
> -                             ltList.add(lt);
> -                     }
> -             }
>
> +     ltList = org.uimafit.util.JCasUtil.selectCovered(jcas, BaseToken.class, covering);
>
>       return ltList;
>
> I know you said it was quick and dirty at the moment - my 2 cents - unless someone comes
up with a better engineered solution, I think we could add the new method (with a name like
getLookupTokens) and leave the old one so we don't have to deprecate anything. And phase in
the change to the various *LookupInitializerImpl classes if needed.
>
> -- James
>
>
>> -----Original Message-----
>> From: ctakes-dev-return-1138-Masanz.James=mayo.edu@incubator.apache.org
>> [mailto:ctakes-dev-return-1138-Masanz.James=mayo.edu@incubator.apache.org]
>> On Behalf Of Masanz, James J.
>> Sent: Monday, February 04, 2013 4:01 PM
>> To: ctakes-dev@incubator.apache.org
>> Subject: RE: assistance with dictionary lookup issue
>>
>> I'll take a look at the patch. Also be aware of
>> https://issues.apache.org/jira/browse/CTAKES-31 which talks about a way of
>> enhancing performance  -- if willing to assume annotations (BaseTokens
>> currently) are sorted. Currently it's always BaseToken and always sorted,
>> just not sure if we want to code to that assumption.
>>
>> ________________________________________
>> From: ctakes-dev-return-1137-Masanz.James=mayo.edu@incubator.apache.org
>> [ctakes-dev-return-1137-Masanz.James=mayo.edu@incubator.apache.org] on
>> behalf of Tim Miller [timothy.miller@childrens.harvard.edu]
>> Sent: Monday, February 04, 2013 3:43 PM
>> To: ctakes-dev@incubator.apache.org
>> Subject: assistance with dictionary lookup issue
>>
>> Pei helped me track down an issue with performance I'd noticed in the
>> dictionary annotator, and I have filed the issue here:
>> https://issues.apache.org/jira/browse/CTAKES-143
>>
>> I implemented a quick and dirty proof of concept fix and noticed dramatic
>> performance improvement.  I attached the patch to the issue, but it
>> involves changing an interface (currently does not try to fix other
>> implementing classes so obviously not ready for primetime), so I wanted to
>> solicit the list first in case anyone with better knowledge of that module
>> has some better engineering ideas than what I came up with.
>>
>> Thanks,
>>
>> --
>> Tim Miller, PhD
>> Postdoctoral Research Fellow
>> Children's Hospital Informatics Program
>> Children's Hospital Boston and Harvard Medical School
>> 617-919-1223
Mime
View raw message