ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Masanz, James J." <Masanz.Ja...@mayo.edu>
Subject RE: Dictionary Lookup algorithm
Date Thu, 11 Apr 2013 20:03:47 GMT

This doesn't have the detail you want, but if you haven't seen it already, you still might
want to start with the following page, and then also re-read if after you read this post.

In particular note the mentions of the LookupDescriptorFile.

That page doesn't have details of the various classes such as FirstTokenPermutationImpl.

I believe we don't have anything better than the javadocs for FirstTokenPermutationImpl.
Pei generated a preview of the latest javadocs under staging:

I can give a sketch of what I know about the lookup algorithms, using the example of FirstTokenPermutationImpl:
Suppose you are using AggregatePlaintextUMLSProcessor.xml in ctakes-clinical-pipeline.
After all the noun phrases are found, the LookupWindowAnnotator is used to create a LookupWindowAnnotation
for each noun phrase.
Any overlapping LookupWindowAnnotations are merged (by MaxLookupWindows annotator).

Then for each LookupWindowAnnotation, the following is done;
 - for each token, look up the token in the 
   "first token" field of the dictionary.
 - if the token is found at least once, collect all dictionary
   entries that start with that token
 - for each dictionary entry:
    -- within the current LookupWindowAnnotation, but within
       n tokens to the right of the current token, try to 
       find all the other tokens from the dictionary entry.
       If they are all found, add the dictionary entry to 
       the list of hits
       A token is considered "found" if either there is 
       an exact match or a match to the normalized form 
       of the word (due to in LookupDesc*xml)
 - Then something like NamedEntityLookupConsumerImpl 
   is used to create the actual annotation within the CAS.

Since comparisons are done one token at at time, it is important that the dictionary be tokenized
the same way that the text is being tokenized.
Since FirstTokenPermutationImpl looks out n tokens, if all the words in a dictionary entry
of x tokens are found within a single LookupWindowAnnotation, where x < n and x < length
of LookupWindowAnnotation, intervening words are allowed and ignored. And also word order
is ignored, except that the first word must be to the left of all the other words (since the
FirstTokenPermutationImpl algorithm looks only to the right of the current token)

The above is mostly taken from memory. And I've glossed over a number of details. Hopefully
this at least gives an overview.

-- James

> -----Original Message-----
> From: dev-return-1493-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-
> return-1493-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of shady
> hussein
> Sent: Thursday, April 11, 2013 9:11 AM
> To: ctakes-dev@incubator.apache.org
> Subject: Dictionary Lookup algorithm
> Dear All,
>   Is there a documentation somewhere, about how the dictionary lookup
> method works exactly ?. Of course i can check the code of
> "DirectPassThroughImpl" and "FirstTokenPermutationImpl", but i find it
> waste of time, if there is a documentation somewhere. Also i would like to
> understand how the lookupwindow annotation works. If there is some guide
> to these things. I would be very grateful
> Thanks,
> 	Shady

View raw message