incubator-ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Finan, Sean" <Sean.Fi...@childrens.harvard.edu>
Subject RE: assistance with dictionary lookup issue
Date Tue, 05 Feb 2013 16:14:38 GMT
Looks good to me. 

-----Original Message-----
From: Masanz, James J. [mailto:Masanz.James@mayo.edu] 
Sent: Tuesday, February 05, 2013 11:09 AM
To: 'ctakes-dev@incubator.apache.org'
Subject: RE: assistance with dictionary lookup issue

Yes, that's what I had in mind, with a future work item to consider doing the same improvement
to the other LookupInitializer(s) some day.

-- James

> -----Original Message-----
> From: 
> ctakes-dev-return-1148-Masanz.James=mayo.edu@incubator.apache.org
> [mailto:ctakes-dev-return-1148-Masanz.James=mayo.edu@incubator.apache.
> org]
> On Behalf Of Tim Miller
> Sent: Tuesday, February 05, 2013 10:05 AM
> To: ctakes-dev@incubator.apache.org
> Subject: Re: assistance with dictionary lookup issue
> 
> OK thanks, that clarifies it (including Sean's concern) for me.  I 
> think calling it getSortedLookupTokens is the right idea.  Then it 
> makes it clear to the implementing class that at some point sorting 
> has to be done, and if its not implicit/free in the implementation (as 
> it is if you base it on Annotations), then it needs to be explicit.  
> Since that can be a substantial performance penalty, putting the onus 
> on the implementation will hopefully lead to better implementations.
> 
> So to summarize the changes to make sure we're on the same page:
> 
>   * the interface will add a method:
> 
>              public List getSortedLookupTokens(JCas, Annotation);
> 
>   * getLookupTokenIterator() will be reverted to its old version.
> 
>   * FirstTokenPermLookupInitializerImpl will have its
>     getLookupTokenIterator method reverted, and my changes
> 
> will go in the implementation of getSortedLookupTokens()   modulo the
> unnecessary iterator inside.
> 
>   * DictionaryLookupAnnotator will be changed to call
>     getSortedLookupTokens instead of
>     getLookupTokenIterator/constrainToWindow
>   * Other LookupInitializer's will be implemented to simply call
>     existing getLookupTokenIterator and do post-processing to constrain
>     to the window (?)
> 
> Does this match what you had in mind James?  Any objections or things 
> I'm missing anyone?
> 
> 
> On 02/05/2013 10:38 AM, Masanz, James J. wrote:
> > First, about the loop - I had been looking too quickly at the diff 
> > and
> didn't notice the logic about punctuation etc
> >
> > Second, what I remember when I looked at it before, was seeing 
> > interface
> named LookupInitializer, which being old enough, doesn't have Iterator 
> parameterized in the definition of the getLookupTokenIterator:
> >      public Iterator getLookupTokenIterator(JCas jcas)
> >              throws AnnotatorInitializationException;
> >
> > and that ends up being effectively an Iterator<LookupToken> and
> LookupToken does not inherit from Annotation, and I stopped at that point.
> >
> > But now looking farther, it looks to me that that's fine because in
> FirstTokenPermLookupInitializerImpl, we look through the BaseTokens 
> and create the list of LookupTokens based on the (sorted) BaseTokens, 
> so the LookupTokens will be sorted too.
> >
> > so maybe we should call the new method getSortedLookupTokens to make 
> > it
> clear they too are sorted
> >
> > ________________________________________
> > From: 
> > ctakes-dev-return-1146-Masanz.James=mayo.edu@incubator.apache.org
> [ctakes-dev-return-1146-Masanz.James=mayo.edu@incubator.apache.org] on 
> behalf of Tim Miller [timothy.miller@childrens.harvard.edu]
> > Sent: Tuesday, February 05, 2013 9:10 AM
> > To: ctakes-dev@incubator.apache.org
> > Subject: Re: assistance with dictionary lookup issue
> >
> > Yeah, if you mean just change the loop to iterate over the list 
> > instead of getting an iterator that makes sense.  There is still 
> > some logic in there to leave out punctuation tokens but I think you 
> > were implying that to be in your mockup diff.
> >
> > As for sorting, the AnnotationIndex defines a sort order for its
> iterators:
> > http://uima.apache.org/d/uimaj-
> 2.4.0/apidocs/org/apache/uima/cas/text/AnnotationIndex.html
> >
> > so we are safe assuming that anything extending Annotation will be 
> > iterated in sorted order.  Does that answer the questions we had? Or 
> > was I missing something subtle in that discussion?
> >
> > Tim
> >
> > On 02/05/2013 09:44 AM, Masanz, James J. wrote:
> >> Looks good to me, with one question.
> >>
> >> Instead of getting an iterator and then building a new list, can we
> just skip getting the iterator and use the list that selectCovered 
> returns?
> >>
> >> I will mock up a diff here of what I mean:
> >> -     Iterator btaItr = org.uimafit.util.JCasUtil.selectCovered(jcas,
> BaseToken.class, covering).iterator();
> >> -     while (btaItr.hasNext())
> >> -             {
> >> -                     BaseToken bta = (BaseToken) btaItr.next();
> >> -                             ltList.add(lt);
> >> -                     }
> >> -             }
> >>
> >> +     ltList = org.uimafit.util.JCasUtil.selectCovered(jcas,
> BaseToken.class, covering);
> >>
> >>        return ltList;
> >>
> >> I know you said it was quick and dirty at the moment - my 2 cents -
> unless someone comes up with a better engineered solution, I think we 
> could add the new method (with a name like getLookupTokens) and leave 
> the old one so we don't have to deprecate anything. And phase in the 
> change to the various *LookupInitializerImpl classes if needed.
> >>
> >> -- James
> >>
> >>
> >>> -----Original Message-----
> >>> From: ctakes-dev-return-1138-
> Masanz.James=mayo.edu@incubator.apache.org
> >>> [mailto:ctakes-dev-return-1138-
> Masanz.James=mayo.edu@incubator.apache.org]
> >>> On Behalf Of Masanz, James J.
> >>> Sent: Monday, February 04, 2013 4:01 PM
> >>> To: ctakes-dev@incubator.apache.org
> >>> Subject: RE: assistance with dictionary lookup issue
> >>>
> >>> I'll take a look at the patch. Also be aware of
> >>> https://issues.apache.org/jira/browse/CTAKES-31 which talks about 
> >>> a
> way of
> >>> enhancing performance  -- if willing to assume annotations 
> >>> (BaseTokens
> >>> currently) are sorted. Currently it's always BaseToken and always
> sorted,
> >>> just not sure if we want to code to that assumption.
> >>>
> >>> ________________________________________
> >>> From: ctakes-dev-return-1137-
> Masanz.James=mayo.edu@incubator.apache.org
> >>> [ctakes-dev-return-1137-Masanz.James=mayo.edu@incubator.apache.org
> >>> ] on behalf of Tim Miller [timothy.miller@childrens.harvard.edu]
> >>> Sent: Monday, February 04, 2013 3:43 PM
> >>> To: ctakes-dev@incubator.apache.org
> >>> Subject: assistance with dictionary lookup issue
> >>>
> >>> Pei helped me track down an issue with performance I'd noticed in 
> >>> the dictionary annotator, and I have filed the issue here:
> >>> https://issues.apache.org/jira/browse/CTAKES-143
> >>>
> >>> I implemented a quick and dirty proof of concept fix and noticed
> dramatic
> >>> performance improvement.  I attached the patch to the issue, but 
> >>> it involves changing an interface (currently does not try to fix 
> >>> other implementing classes so obviously not ready for primetime), 
> >>> so I
> wanted to
> >>> solicit the list first in case anyone with better knowledge of 
> >>> that
> module
> >>> has some better engineering ideas than what I came up with.
> >>>
> >>> Thanks,
> >>>
> >>> --
> >>> Tim Miller, PhD
> >>> Postdoctoral Research Fellow
> >>> Children's Hospital Informatics Program Children's Hospital Boston 
> >>> and Harvard Medical School
> >>> 617-919-1223


Mime
View raw message