ctakes-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From seanfi...@apache.org
Subject svn commit: r1574227 - in /ctakes/sandbox/ctakes-dictionary-lookup2/doc: DictionaryLookupHelp.docx DictionaryLookupHelp.txt DictionaryLookupStats.docx
Date Tue, 04 Mar 2014 22:30:50 GMT
Author: seanfinan
Date: Tue Mar  4 22:30:49 2014
New Revision: 1574227

URL: http://svn.apache.org/r1574227
Log:
First commit of documentation.  No Jira

Added:
    ctakes/sandbox/ctakes-dictionary-lookup2/doc/DictionaryLookupHelp.docx   (with props)
    ctakes/sandbox/ctakes-dictionary-lookup2/doc/DictionaryLookupHelp.txt
    ctakes/sandbox/ctakes-dictionary-lookup2/doc/DictionaryLookupStats.docx   (with props)

Added: ctakes/sandbox/ctakes-dictionary-lookup2/doc/DictionaryLookupHelp.docx
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-dictionary-lookup2/doc/DictionaryLookupHelp.docx?rev=1574227&view=auto
==============================================================================
Binary file - no diff available.

Propchange: ctakes/sandbox/ctakes-dictionary-lookup2/doc/DictionaryLookupHelp.docx
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: ctakes/sandbox/ctakes-dictionary-lookup2/doc/DictionaryLookupHelp.txt
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-dictionary-lookup2/doc/DictionaryLookupHelp.txt?rev=1574227&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-dictionary-lookup2/doc/DictionaryLookupHelp.txt (added)
+++ ctakes/sandbox/ctakes-dictionary-lookup2/doc/DictionaryLookupHelp.txt Tue Mar  4 22:30:49
2014
@@ -0,0 +1,311 @@
+cTakes Dictionary Lookup Module 2 Help
+
+3-04-2014
+
+
+
+
+Lookup Workflow
+
+No Lookup Initializer
+Unlike the current dictionary module, the new one has no lookup initializer that sits apart
from the lookup annotator.  All functionality of the current module’s Initializers resides
in the new module’s Annotator.  This removes some of the “pluggability” that exists for
the current module, but greatly reduces complexity and overhead.
+Lookup per Registered Dictionary per Lookup Window
+For each proper lookup token encountered in the lookup window, the new dictionary lookup
code runs through a loop that performs lookup per each registered Dictionary, finding terms
based upon the lookup token per dictionary.  All Lookups are performed in the same manner,
just using different term repositories (dictionaries).  
+Term Match
+Given the full text for all candidate terms in a dictionary, a match is attempted using that
text on the full text of the lookup window.  
+Single Lookup Consumer
+After all terms have been appropriated from all dictionaries, a single Consumer takes the
collection of terms and performs some action(s) upon them, such as storing them in the Cas.
 A Consumer (or delegates) can be created to differently handle terms with different Cuis,
Tuis, words, entity types, schemes, etc. if that is necessary, but there is one consumer entry
point, called once with the entire collection of discovered terms from all dictionaries.
+
+
+
+
+
+
+
+
+Lookup Methods
+
+Lookup by Rare Word
+Each dictionary entry, UMLS or other, contains the rare word used for the lookup key, a count
of words (tokens) within the term, and an index of the rare word within the term.  For each
lookup token (word of proper part of speech and span length) in the note, the lookup code
queries the dictionary for terms with a matching rare word.   For each returned term, a match
is attempted using the term and the appropriate token coverage in the note based upon the
offset indicated by the rare word index and length of the term.  This creates a forward and
backward matching method instead of the forward-only matching method of the current dictionary
(first word) lookup.  Using a term’s rare word as an index instead of a first word decreases,
on average, the number of terms for which matching must be performed.
+Text Exact Match
+Because the UMLS dictionary contains rows with different combinations of lexical elements
per term, using a direct string match of text in note to text of term is a valid candidate
for term matching.  This is different from the complex mechanism in the current (first word)
lookup, and makes for simpler code and greater accuracy.  This precise specification (and
improved lookup speed) enables the use of an entire sentence as a lookup window rather than
just a noun phrase.  Usage of Sentence as a lookup window allows all possible tokens to be
used for not only lookup keys, but also for term matching.  For proper accuracy, custom dictionaries
should also contain multiple entries for variations of term syntax.  Note that term matching
is attempted using the actual text in the note and also per-token cTakes-generated lexical
variants of the text in the note.
+Text Overlap Match
+To better approximate the current (first word) lookup annotator, one lookup method finds
overlapping terms in addition to exact matching terms.  This allows matches on discontiguous
spans.  For instance, for the text “blood, urine test” the exact match will find only one
procedure: “urine test”.  The overlap match will find both “urine test” and “blood  
  test”.
+
+Persistence Methods (Consumer)
+
+All Terms Persistence
+All terms discovered by the matchers can be stored in the CAS by a consumer, regardless of
any property of the term.  This means that for the text “lung cancer” the specific disease
term “lung cancer” and broader term “cancer”.  This can be useful for future searches
on general concepts, e.g. searching via the CUI for “cancer” and getting all instances of
“cancer” found in texts “lung cancer”, “skin cancer”, “stomach cancer”, etc.  
+Most Precise Term Persistence
+Matched terms can be stored only by the longest overlapping span discovered for a semantic
group.  This keeps, for instance, the disease “lung cancer” but not “cancer”.  Using semantic
groups means that both the disease “lung cancer” and the anatomical site “lung” are persisted
even though the spans overlap.  When using the overlap matching method, any discontiguous
spans are accounted for.  So, for “blood, urine test” both the discontiguous spanned term
“blood   test” and the contiguous spanned term “urine test” are valid.
+
+
+Dictionary Indexing
+
+Lookup by Rare Word
+No matter what text Match algorithm is used, it must run for every Term’s text returned
from the dictionary lookup of a Lookup Token.  To minimize the number of Terms returned from
the dictionary lookup, a Dictionary should be indexed by a Rare Word for each Term.  The Rare
Word would be a token of proper Lookup Token Part of Speech in the Term Text which has the
minimum number of instances in the entire Dictionary.  Some examples (‘-‘ denotes a non-word):
+
+Term Text
Word Counts in Dictionary Texts
Rare Word in Index
Rare Word Count in Index
Medical care vascular complications
11  3  1  1
vascular
1
Medical / dental care
11  -  2  3
dental
1
Medical / dental care treatments and procedures
11  -  2  3  1  -  1
treatments
1
Medical imaging
11  2
imaging
1
Medical imaging , ultrasound
11  2  -  1
ultrasound
1
Medical identification band on
11  3  1  -
band
1
Medical identification bracelet on
11  1  2  -
bracelet
1
Medical identification bracelet present
11  3  2  1
present
1
Medical procedure on the nervous system
11  3  -  -  3  3
procedure
1
Medical procedure on the central nervous system
11  3  -  -  1  3  3
central
1
Medical procedure on the peripheral nervous system
11  3  -  -  1  3  3
peripheral
1For any Lookup Window and the given Dictionary, any Lookup Token “Medical” would return
zero candidate terms and require no Match processing.  Only when a Lookup Token exists as
a Rare Word in a Term will it exist in the Dictionary index and return one or more candidate
Terms upon which to attempt a Match.  Sometimes this will be the first Token in a Term, but
often it is not.  Because a Lookup Token can be in the middle of a Term’s text, the Match
must be attempted by checking words both after and before the Lookup Token in the Lookup Window
text.
+
+
+Example Annotator Descriptor XML
+
+The following Descriptor XML is in the module’s example directory and can be used to perform
lookup using a Bar-Separated Value flat file as the dictionary.
+
+<taeDescription xmlns="http://uima.apache.org/resourceSpecifier">
+<frameworkImplementation>org.apache.uima.java</frameworkImplementation>
+<primitive>true</primitive>
+<annotatorImplementationName>
+	org.apache.ctakes.dictionary.lookup2.ae.DefaultJCasTermAnnotator
+</annotatorImplementationName>
+
+The above specifies that the default lookup annotator implementation should be used.  The
default lookup only uses contiguous text spans.  To specify usage of the lookup annotator
that allows terms on discontiguous spans, use:
+
+<annotatorImplementationName>
+org.apache.ctakes.dictionary.lookup2.ae.OverlapJCasTermAnnotator
+</annotatorImplementationName>
+
+In the previous Dictionary Lookup Module, the type of lookup window and the excluded parts
of speech were not specified in the analysis engine descriptor, they were specified in a secondary
lookup specification file.  For the new module this information is in the analysis engine
descriptor.
+
+<analysisEngineMetaData>
+<!-- Simple descriptor that uses DefaultJCasTermAnnotator and Bsv Dictionary File -->
+     <name>CustomLookupAnnotatorBSV</name>
+     <description>
+Dictionary Lookup Annotator descriptor for dictionaries which actually lie in a BSV file
+     </description>
+  <version/>
+     <vendor/>
+     <configurationParameters>
+<!-- windowAnnotations and exclusionTags were originally for the LookupConsumer, but now
apply to the annotator -->
+<configurationParameter>
+<name>windowAnnotations</name>
+<description>Type of window to use for lookup</description>
+  <type>String</type>
+   <multiValued>false</multiValued>
+     <mandatory>true</mandatory>
+</configurationParameter>
+     <configurationParameter>
+  <name>exclusionTags</name>
+<description>
+Parts of speech to ignore when considering lookup tokens
+     </description>
+<type>String</type>
+<multiValued>false</multiValued>
+<mandatory>false</mandatory>
+</configurationParameter>
+         <configurationParameter>
+     <name>minimumSpan</name>
+     <description>
+     Minimum required span length of tokens to use for lookup.  Default is 3
+     </description>
+     <type>Integer</type>
+     <multiValued>false</multiValued>
+     <mandatory>false</mandatory>
+     </configurationParameter>
+     </configurationParameters>
+
+<configurationParameterSettings>
+<nameValuePair>
+<name>windowAnnotations</name>
+<value>
+<!--  LookupWindowAnnotation is supposed to be a refined Noun Phrase  -->
+     <string>
+     org.apache.ctakes.typesystem.type.textspan.LookupWindowAnnotation
+</string>
+</value>
+     </nameValuePair>
+<nameValuePair>
+<name>exclusionTags</name>
+<value>
+     <string>
+     VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,IN,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,
+     RP,TO,WDT,WP,WPS,WRB
+</string>
+</value>
+</nameValuePair>
+     <nameValuePair>
+     <name>minimumSpan</name>
+     <value>
+     <integer>3</integer>
+     </value>
+     </nameValuePair>
+</configurationParameterSettings>
+
+The above specifies that the lookup window annotation should be a LookupWindowAnnotation,
which is supposed to represent noun phrases in the text.  A larger, more inclusive lookup
window would be a sentence.  To specify usage of a sentence as the lookup window, use:
+
+<nameValuePair>
+<name>windowAnnotations</name>
+<value>
+<!--  In some instances LookupWindowAnnotation is missing tokens , but 
+Sentence can be used -->
+     <string>org.apache.ctakes.typesystem.type.textspan.Sentence</string>
+</value>
+     </nameValuePair>
+
+If the discontiguous span annotator (OverlapJCasTermAnnotator) is used, then two additional
configuration parameters are available.  These parameters are optional and modify the total
number of non-matching tokens that can be skipped and the number of consecutive non-matching
tokens that can be skipped.
+     <configurationParameter>
+     <name>consecutiveSkips</name>
+     <description>
+     Number of consecutive tokens that can be skipped when matching terms.  Default is 2
+     </description>
+     <type>Integer</type>
+     <multiValued>false</multiValued>
+     <mandatory>false</mandatory>
+     </configurationParameter>
+<configurationParameter>
+     <name>totalTokenSkips</name>
+     <description>
+     Total number of tokens that can be skipped when matching terms.  Default is 4
+     </description>
+     <type>Integer</type>
+     <multiValued>false</multiValued>
+     <mandatory>false</mandatory>
+     </configurationParameter>
+     
+     …
+     
+     <nameValuePair>
+     <name>consecutiveSkips</name>
+     <value>
+     <integer>2</integer>
+     </value>
+     </nameValuePair>
+     <nameValuePair>
+     <name>totalTokenSkips</name>
+     <value>
+     <integer>4</integer>
+     </value>
+     </nameValuePair>
+
+What follows is exactly the same as the description used for the previous dictionary lookup
module descriptor.
+ 
+     <typeSystemDescription>
+         	<imports>
+         	</imports>
+     </typeSystemDescription>
+     <typePriorities/>
+     <fsIndexCollection/>
+     <capabilities>
+     <capability>
+     <inputs>
+     <type allAnnotatorFeatures="true">
+     org.apache.ctakes.typesystem.type.syntax.BaseToken
+     </type>
+     <type allAnnotatorFeatures="true">
+     org.apache.ctakes.typesystem.type.textspan.LookupWindowAnnotation
+     </type>
+     </inputs>
+     <outputs>
+     <type allAnnotatorFeatures="true">
+     org.apache.ctakes.typesystem.type.textsem.IdentifiedAnnotation
+     </type>
+     </outputs>
+     <languagesSupported/>
+</capability>
+</capabilities>
+     <operationalProperties>
+     <modifiesCas>true</modifiesCas>
+     <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
+     <outputsNewCASes>false</outputsNewCASes>
+     </operationalProperties>
+</analysisEngineMetaData>
+
+The resource manager configuration sets up the external resources used by the dictionary
lookup module.  There must be a url to a dictionary specification descriptor and a url to
each dictionary resource.  What follows is exactly as it was for the previous lookup module,
except that the external resource dependency LookupDescriptor has been replaced by DictionaryDescriptor.

+ 
+<externalResourceDependencies>
+<!— DictionaryDescriptor is the key for  the file that contains dictionary specs-->
+<externalResourceDependency>
+<key>DictionaryDescriptor</key>
+<description/>
+<interfaceName>org.apache.ctakes.core.resource.FileResource</interfaceName>
+optional>false</optional>
+</externalResourceDependency>
+      	<!-- CustomBsvDictionary is the key for the actual bsv file that contains the dictionary
-->
+     <!-- This is not a filename, just a key to connect what is here with what is in the
LookupDescriptor -->
+<externalResourceDependency>
+     <key>CustomBsvDictionary</key>
+     <description/>
+     <interfaceName>org.apache.ctakes.core.resource.FileResource</interfaceName>
+     <optional>false</optional>
+     </externalResourceDependency>
+</externalResourceDependencies>
+
+<resourceManagerConfiguration>
+     <externalResources>
+     <externalResource>
+<!-- The Binding is below, for DictionaryDescriptor = DictionaryDescriptorFile -->
+     <name>DictionaryDescriptorFile</name>
+     <description/>
+     <fileResourceSpecifier>
+     <fileUrl>
+file:org/apache/ctakes/dictionary/lookup2/example/bsv/RareWord_BSV.xml
+     </fileUrl>
+     </fileResourceSpecifier>
+     <implementationName>
+     org.apache.ctakes.core.resource.FileResourceImpl
+     </implementationName>
+     </externalResource>
+     <externalResource>
+     <!-- The Binding is below, for CustomBsvDictionary = BsvDictionaryResource -->
+     <name>BsvDictionaryResource</name>
+     <description/>
+     <fileResourceSpecifier>
+     <fileUrl>
+     file:org/apache/ctakes/dictionary/lookup2/example/bsv/CustomBsvDictionary.bsv
+     </fileUrl>
+     </fileResourceSpecifier>
+     <implementationName>
+     org.apache.ctakes.core.resource.FileResourceImpl
+     </implementationName>
+     </externalResource>
+     </externalResources>
+     <externalResourceBindings>
+     <externalResourceBinding>
+     <key>DictionaryDescriptor</key>
+     <resourceName>DictionaryDescriptorFile</resourceName>
+     </externalResourceBinding>
+     <externalResourceBinding>
+     <!-- binding key CustomBsvDictionary points up to the 
+externalResource named BsvDictionaryResource -->
+     <key>CustomBsvDictionary</key>
+     <resourceName>BsvDictionaryResource</resourceName>
+     </externalResourceBinding>
+     </externalResourceBindings>
+</resourceManagerConfiguration>
+</taeDescription>
+
+
+
+Example Dictionary Specification Descriptor XML
+
+The following Descriptor XML is in the module’s resource example directory and can be used
to perform lookup using a Bar-Separated Value flat file as the dictionary.
+
+<lookupSpecification>
+     <!--  Defines what dictionaries will be used  -->
+     <rareWordDictionaries> 
+     <!-- typeId 0 is the cTakes constant for unknown - let the consumer take care of
typing  -->
+<dictionary id="customBsv" externalResourceKey="CustomBsvDictionary" caseSensitive="false"
typeId="0">
+     <implementation>
+     <rareWordBsv/>
+         	</implementation>
+     </dictionary>
+     </rareWordDictionaries>
+
+<!-- DefaultTermConsumer will persist all spans  -->
+<rareWordConsumer className="org.apache.ctakes.dictionary.lookup2.consumer.DefaultTermConsumer">
+     <properties>
+     <property key="codingScheme" value="cTakes"/>
+     </properties>
+     </rareWordConsumer>
+</lookupSpecification>
+
+To only persist only the longest overlapping term of any given span for each semantic group,
use the PrecisionTermConsumer.
+
+<!-- PrecisionTermConsumer will only persist only the longest overlapping span of any
semantic group -->
+<rareWordConsumer className="org.apache.ctakes.dictionary.lookup2.consumer.PrecisionTermConsumer">
+     <properties>
+     <property key="codingScheme" value="cTakes"/>
+     </properties>
+</rareWordConsumer>
+
+

Added: ctakes/sandbox/ctakes-dictionary-lookup2/doc/DictionaryLookupStats.docx
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-dictionary-lookup2/doc/DictionaryLookupStats.docx?rev=1574227&view=auto
==============================================================================
Binary file - no diff available.

Propchange: ctakes/sandbox/ctakes-dictionary-lookup2/doc/DictionaryLookupStats.docx
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream



Mime
View raw message