uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Tanenblatt <sloth...@park-slope.net>
Subject Re: SemClass feature not working in ConceptMapper add-on
Date Mon, 21 Apr 2014 17:04:15 GMT
SemClass doesn’t need to be part of the token annotation, but it can be—let me explain:
the DictTerm annotations are being used to indicate a match found from the dictionary, and
could cover multiple token annotations. Therefore a token annotation is not sufficient for
indicating the match. But, ConceptMapper does have the ability to (optionally) write values
back to the individual tokens of a match. So, if a match is found in the dictionary that has
a SemClass of “X”, it can be configured to also set the SemClass feature of the token(s)
that were matched in addition to the DictTerm that covers those tokens.

On Apr 21, 2014, at 10:28 AM, Kothuvatiparambil, Viju <viju.kothuvatiparambil@bankofamerica.com>
wrote:

> Hi Michael,
> 
> Thank you so much for your reply. I think I can follow your suggestion and get it working,
but I still have one more question in my mind. I see that the SemClass is already in the type
system as a feature of uima.tt.TokenAnnotation (see the XML fragment below). What is the purpose
of this ? How should I decide that a feature should be part of TokenAnnotation or DictTerm
?
> 
> 
> 		<typeSystemDescription>
> 			<imports>
> 				<import name="org.apache.uima.conceptMapper.DictTerm" />
> 				<import
> 					name="org.apache.uima.conceptMapper.support.tokenizer.TokenAnnotation" />
> 			</imports>
> 			<types>
> 				<typeDescription>
> 					<name>uima.tt.TokenAnnotation</name>
> 					<description></description>
> 					<supertypeName>uima.tcas.Annotation</supertypeName>
> 					<features>
> 					
> 						<featureDescription>
> 							<name>SemClass</name>
> 							<description>
> 								semantic class of token
> 							</description>
> 							<rangeTypeName>
> 								uima.cas.String
> 							</rangeTypeName>
> 						</featureDescription>
>                     ....
> 
> Btw, this is a great framework. I can see that I will be using it a lot. I would like
to get involved in the development if you are looking for new resources.
> 
> Thanks
> Viju.
> 
> 
> 
> 
> -----Original Message-----
> From: Michael Tanenblatt [mailto:slothrop@park-slope.net] 
> Sent: Monday, April 21, 2014 6:24 AM
> To: user@uima.apache.org
> Subject: Re: SemClass feature not working in ConceptMapper add-on
> 
> You are exactly correct in your analysis: by specifying those values for AttributeList
and FeatureList, ConceptMapper is trying to write the value of the SemClass in your dictionary
entries to your resulting annotation, which appears to be DictTerm, and DIctTerm does not
appear to have the SemClass feature as it is currently defined. The solution is to extend
the definition of the DictTerm type to include the the feature SemClass (which should be a
String).
> 
> 
> On Apr 20, 2014, at 4:10 PM, Kothuvatiparambil, Viju <viju.kothuvatiparambil@bankofamerica.com>
wrote:
> 
>> Hi All, 
>> 
>> I am trying to use the ConceptMapper add on to assign a SemClass feature to tokens.
I am getting the following error:
>> 
>> SEVERE: ConceptMapper SEVERE: FeatureList[1] 'SemClass' specified, but does not exist
for type: org.apache.uima.conceptMapper.DictTerm
>> 
>> I configured FeatureList and AttributeList in ConceptMapperOffsetTokenizer.xml as
given below:
>> 
>> 			<nameValuePair>
>> 				<name>AttributeList</name>
>> 				<value>
>> 					<array>
>> 						<string>canonical</string>
>> 						<string>SemClass</string>
>> 					</array>
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>FeatureList</name>
>> 				<value>
>> 					<array>
>> 						<string>DictCanon</string>
>> 						<string>SemClass</string>
>> 					</array>
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>ResultingAnnotationName</name>
>> 				<value>
>> 					<string>
>> 						org.apache.uima.conceptMapper.DictTerm
>> 					</string>
>> 				</value>
>> 			</nameValuePair>
>> 
>> Here is my simplified dict.xml file
>> 
>> <synonym>
>> <token canonical="grocery" SemClass="category">
>>    <variant base="grocery"/>
>> </token>
>> </synonym>
>> 
>> I debugged the problem and found that it is looking for the SemClass feature in resultAnnotationType
which DictTerm. But actually, the SemClass is not a feature in DictTerm type.
>> 
>>     resultEnclosingSpan = resultAnnotationType.getFeatureByBaseName(resultEnclosingSpanName);
>>     if (resultEnclosingSpan == null) {
>>       logger.logError(PARAM_ENCLOSINGSPAN + " '" + resultEnclosingSpanName
>>               + "' specified, but does not exist for type: " + resultAnnotationType.getName());
>>       throw new AnnotatorInitializationException();
>>     }
>> 
>> I just started using UIMA, so I don't understand the complete architecture yet. Could
any of you point me to the right direction ?  Thanks a lot in advance.
>> 
>> Viju Kothuvatiparambil
>> 
>> Here is the complete ConceptMapperOffsetTokenizer.xml file contents:
>> 
>> <taeDescription xmlns="http://uima.apache.org/resourceSpecifier">
>> 	<frameworkImplementation>org.apache.uima.java</frameworkImplementation>
>> 	<primitive>true</primitive>
>> 	<annotatorImplementationName>org.apache.uima.conceptMapper.ConceptMapper</annotatorImplementationName>
>> 	<analysisEngineMetaData>
>> 		<name>ConceptMapper</name>
>> 		<description></description>
>> 		<version>1</version>
>> 		<vendor></vendor>
>> 		<configurationParameters>
>> 			<configurationParameter>
>> 				<name>caseMatch</name>
>> 				<description>
>> 					this parameter specifies the case folding mode:
>> 					ignoreall - fold everything to lowercase for
>> 					matching insensitive - fold only tokens with initial
>> 					caps to lowercase digitfold - fold all (and only)
>> 					tokens with a digit sensitive - perform no case
>> 					folding
>> 				</description>
>> 				<type>String</type>
>> 				<multiValued>false</multiValued>
>> 				<mandatory>true</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>Stemmer</name>
>> 				<description>
>> 					Name of stemmer class to use before matching. MUST
>> 					have a zero-parameter constructor! If not specified,
>> 					no stemming will be performed.
>> 				</description>
>> 				<type>String</type>
>> 				<multiValued>false</multiValued>
>> 				<mandatory>false</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>ResultingAnnotationName</name>
>> 				<description>
>> 					Name of the annotation type created by this TAE,
>> 					must match the typeSystemDescription entry
>> 				</description>
>> 				<type>String</type>
>> 				<multiValued>false</multiValued>
>> 				<mandatory>true</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>ResultingEnclosingSpanName</name>
>> 				<description>
>> 					Name of the feature in the resultingAnnotation to
>> 					contain the span that encloses it (i.e. its
>> 					sentence)
>> 				</description>
>> 				<type>String</type>
>> 				<multiValued>false</multiValued>
>> 				<mandatory>false</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>AttributeList</name>
>> 				<description>
>> 					List of attribute names for XML dictionary entry
>> 					record - must correspond to FeatureList
>> 				</description>
>> 				<type>String</type>
>> 				<multiValued>true</multiValued>
>> 				<mandatory>true</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>FeatureList</name>
>> 				<description>
>> 					List of feature names for CAS annotation - must
>> 					correspond to AttributeList
>> 				</description>
>> 				<type>String</type>
>> 				<multiValued>true</multiValued>
>> 				<mandatory>true</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>TokenAnnotation</name>
>> 				<description></description>
>> 				<type>String</type>
>> 				<multiValued>false</multiValued>
>> 				<mandatory>true</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>TokenClassFeatureName</name>
>> 				<description>
>> 					Name of feature used when doing lookups against
>> 					IncludedTokenClasses and ExcludedTokenClasses
>> 				</description>
>> 				<type>String</type>
>> 				<multiValued>false</multiValued>
>> 				<mandatory>false</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>TokenTextFeatureName</name>
>> 				<description></description>
>> 				<type>String</type>
>> 				<multiValued>false</multiValued>
>> 				<mandatory>false</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>SpanFeatureStructure</name>
>> 				<description>
>> 					Type of annotation which corresponds to spans of
>> 					data for processing (e.g. a Sentence)
>> 				</description>
>> 				<type>String</type>
>> 				<multiValued>false</multiValued>
>> 				<mandatory>true</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>OrderIndependentLookup</name>
>> 				<description>
>> 					True if should ignore element order during lookup
>> 					(i.e., "top box" would equal "box top"). Default is
>> 					False.
>> 				</description>
>> 				<type>Boolean</type>
>> 				<multiValued>false</multiValued>
>> 				<mandatory>false</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>TokenTypeFeatureName</name>
>> 				<description>
>> 					Name of feature used when doing lookups against
>> 					IncludedTokenTypes and ExcludedTokenTypes
>> 				</description>
>> 				<type>String</type>
>> 				<multiValued>false</multiValued>
>> 				<mandatory>false</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>IncludedTokenTypes</name>
>> 				<description>
>> 					Type of tokens to include in lookups (if not
>> 					supplied, then all types are included except those
>> 					specifically mentioned in ExcludedTokenTypes)
>> 				</description>
>> 				<type>Integer</type>
>> 				<multiValued>true</multiValued>
>> 				<mandatory>false</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>ExcludedTokenTypes</name>
>> 				<description></description>
>> 				<type>Integer</type>
>> 				<multiValued>true</multiValued>
>> 				<mandatory>false</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>ExcludedTokenClasses</name>
>> 				<description>
>> 					Class of tokens to exclude from lookups (if not
>> 					supplied, then all classes are excluded except those
>> 					specifically mentioned in IncludedTokenClasses,
>> 					unless IncludedTokenClasses is not supplied, in
>> 					which case none are excluded)
>> 				</description>
>> 				<type>String</type>
>> 				<multiValued>true</multiValued>
>> 				<mandatory>false</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>IncludedTokenClasses</name>
>> 				<description>
>> 					Class of tokens to include in lookups (if not
>> 					supplied, then all classes are included except those
>> 					specifically mentioned in ExcludedTokenClasses)
>> 				</description>
>> 				<type>String</type>
>> 				<multiValued>true</multiValued>
>> 				<mandatory>false</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>TokenClassWriteBackFeatureNames</name>
>> 				<description>
>> 					names of features that should be written back to a
>> 					token, such as a POS tag
>> 				</description>
>> 				<type>String</type>
>> 				<multiValued>true</multiValued>
>> 				<mandatory>false</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>ResultingAnnotationMatchedTextFeature</name>
>> 				<type>String</type>
>> 				<multiValued>false</multiValued>
>> 				<mandatory>false</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>PrintDictionary</name>
>> 				<type>Boolean</type>
>> 				<multiValued>false</multiValued>
>> 				<mandatory>false</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>SearchStrategy</name>
>> 				<description>
>> 					Can be either "SkipAnyMatch",
>> 					"SkipAnyMatchAllowOverlap" or
>> 					"ContiguousMatch"&#13;&#13;ContiguousMatch: longest
>> 					match of contiguous tokens within enclosing
>> 					span(taking into account included/excluded items).
>> 					DEFAULT strategy &#13;SkipAnyMatch: longest match of
>> 					not-necessarily contiguous tokens within enclosing
>> 					span (taking into account included/excluded items).
>> 					Subsequent lookups begin in span after complete
>> 					match. IMPLIES order-independent lookup
>> 					&#13;SkipAnyMatchAllowOverlap: longest match of
>> 					not-necessarily contiguous tokens within enclosing
>> 					span (taking into account included/excluded items).
>> 					Subsequent lookups begin in span after next token.
>> 					IMPLIES order-independent lookup
>> 				</description>
>> 				<type>String</type>
>> 				<multiValued>false</multiValued>
>> 				<mandatory>false</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>StopWords</name>
>> 				<type>String</type>
>> 				<multiValued>true</multiValued>
>> 				<mandatory>false</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>FindAllMatches</name>
>> 				<type>Boolean</type>
>> 				<multiValued>false</multiValued>
>> 				<mandatory>false</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>MatchedTokensFeatureName</name>
>> 				<type>String</type>
>> 				<multiValued>false</multiValued>
>> 				<mandatory>false</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>ReplaceCommaWithAND</name>
>> 				<type>Boolean</type>
>> 				<multiValued>false</multiValued>
>> 				<mandatory>false</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>TokenizerDescriptorPath</name>
>> 				<type>String</type>
>> 				<multiValued>false</multiValued>
>> 				<mandatory>true</mandatory>
>> 			</configurationParameter>
>> 			<configurationParameter>
>> 				<name>LanguageID</name>
>> 				<type>String</type>
>> 				<multiValued>false</multiValued>
>> 				<mandatory>false</mandatory>
>> 			</configurationParameter>
>> 		</configurationParameters>
>> 		<configurationParameterSettings>
>> 			<nameValuePair>
>> 				<name>caseMatch</name>
>> 				<value>
>> 					<string>ignoreall</string>
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>AttributeList</name>
>> 				<value>
>> 					<array>
>> 						<string>canonical</string>
>> 						<string>SemClass</string>
>> 					</array>
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>FeatureList</name>
>> 				<value>
>> 					<array>
>> 						<string>DictCanon</string>
>> 						<string>SemClass</string>
>> 					</array>
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>TokenAnnotation</name>
>> 				<value>
>> 					<string>uima.tt.TokenAnnotation</string>
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>ResultingAnnotationName</name>
>> 				<value>
>> 					<string>
>> 						org.apache.uima.conceptMapper.DictTerm
>> 					</string>
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>SpanFeatureStructure</name>
>> 				<value>
>> 					<string>uima.tcas.DocumentAnnotation</string>
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>OrderIndependentLookup</name>
>> 				<value>
>> 					<boolean>false</boolean>
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>TokenClassWriteBackFeatureNames</name>
>> 				<value>
>> 					<array />
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>IncludedTokenClasses</name>
>> 				<value>
>> 					<array />
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>PrintDictionary</name>
>> 				<value>
>> 					<boolean>false</boolean>
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>FindAllMatches</name>
>> 				<value>
>> 					<boolean>false</boolean>
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>StopWords</name>
>> 				<value>
>> 					<array />
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>ReplaceCommaWithAND</name>
>> 				<value>
>> 					<boolean>false</boolean>
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>TokenizerDescriptorPath</name>
>> 				<value>
>> 					<string>
>> 						/search/uima/conf/descriptors/OffsetTokenizer.xml
>> 					</string>
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>ResultingEnclosingSpanName</name>
>> 				<value>
>> 					<string>enclosingSpan</string>
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>MatchedTokensFeatureName</name>
>> 				<value>
>> 					<string>matchedTokens</string>
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>ResultingAnnotationMatchedTextFeature</name>
>> 				<value>
>> 					<string>matchedText</string>
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>SearchStrategy</name>
>> 				<value>
>> 					<string>ContiguousMatch</string>
>> 				</value>
>> 			</nameValuePair>
>> 			<nameValuePair>
>> 				<name>LanguageID</name>
>> 				<value>
>> 					<string>en</string>
>> 				</value>
>> 			</nameValuePair>
>> 		</configurationParameterSettings>
>> 		<typeSystemDescription>
>> 			<imports>
>> 				<import name="org.apache.uima.conceptMapper.DictTerm" />
>> 				<import
>> 					name="org.apache.uima.conceptMapper.support.tokenizer.TokenAnnotation" />
>> 			</imports>
>> 			<types>
>> 				<typeDescription>
>> 					<name>uima.tt.TokenAnnotation</name>
>> 					<description></description>
>> 					<supertypeName>uima.tcas.Annotation</supertypeName>
>> 					<features>
>> 						<featureDescription>
>> 							<name>SemClass</name>
>> 							<description>
>> 								semantic class of token
>> 							</description>
>> 							<rangeTypeName>
>> 								uima.cas.String
>> 							</rangeTypeName>
>> 						</featureDescription>
>> 						<featureDescription>
>> 							<name>POS</name>
>> 							<description>
>> 								Part of SPeech of term to which this
>> 								token is a part
>> 							</description>
>> 							<rangeTypeName>
>> 								uima.cas.String
>> 							</rangeTypeName>
>> 						</featureDescription>
>> 						<featureDescription>
>> 							<name>frost_TokenType</name>
>> 							<description></description>
>> 							<rangeTypeName>
>> 								uima.cas.Integer
>> 							</rangeTypeName>
>> 						</featureDescription>
>> 					</features>
>> 				</typeDescription>
>> 			</types>
>> 		</typeSystemDescription>
>> 		<typePriorities>
>> 			<priorityList>
>> 				<!-- <type>uima.tt.SentenceAnnotation</type> -->
>> 				<type>uima.tt.TokenAnnotation</type>
>> 			</priorityList>
>> 		</typePriorities>
>> 		<fsIndexCollection />
>> 		<capabilities>
>> 			<capability>
>> 				<inputs>
>> 					<type allAnnotatorFeatures="true">
>> 						uima.tt.TokenAnnotation
>> 					</type>
>> 					<!-- <type allAnnotatorFeatures="true">uima.tt.SentenceAnnotation</type>
>> 						<type allAnnotatorFeatures="true">uima.tt.ParagraphAnnotation</type>
-->
>> 				</inputs>
>> 				<outputs>
>> 					<type allAnnotatorFeatures="true">
>> 						org.apache.uima.conceptMapper.DictTerm
>> 					</type>
>> 					<type allAnnotatorFeatures="true">
>> 						uima.tt.TokenAnnotation
>> 					</type>
>> 					<type allAnnotatorFeatures="true">
>> 						org.apache.uima.conceptMapper.support.tokenizer.TokenAnnotation
>> 					</type>
>> 					<type allAnnotatorFeatures="true">
>> 						uima.tcas.DocumentAnnotation
>> 					</type>
>> 				</outputs>
>> 				<languagesSupported />
>> 			</capability>
>> 		</capabilities>
>> 		<operationalProperties>
>> 			<modifiesCas>true</modifiesCas>
>> 			<multipleDeploymentAllowed>true</multipleDeploymentAllowed>
>> 			<outputsNewCASes>false</outputsNewCASes>
>> 		</operationalProperties>
>> 	</analysisEngineMetaData>
>> 	<externalResourceDependencies>
>> 		<externalResourceDependency>
>> 			<key>DictionaryFile</key>
>> 			<description>dictionary file loader.</description>
>> 			<interfaceName>
>> 				org.apache.uima.conceptMapper.support.dictionaryResource.DictionaryResource
>> 			</interfaceName>
>> 			<optional>false</optional>
>> 		</externalResourceDependency>
>> 	</externalResourceDependencies>
>> 	<resourceManagerConfiguration>
>> 		<externalResources>
>> 			<externalResource>
>> 				<name>DictionaryFileName</name>
>> 				<description>
>> 					A file containing the dictionary. Modify this URL to
>> 					use a different dictionary.
>> 				</description>
>> 				<fileResourceSpecifier>
>> 					<fileUrl>file:/search/uima/conf/testDict.xml</fileUrl>
>> 				</fileResourceSpecifier>
>> 				<implementationName>
>> 					org.apache.uima.conceptMapper.support.dictionaryResource.DictionaryResource_impl
>> 				</implementationName>
>> 			</externalResource>
>> 		</externalResources>
>> 		<externalResourceBindings>
>> 			<externalResourceBinding>
>> 				<key>DictionaryFile</key>
>> 				<resourceName>DictionaryFileName</resourceName>
>> 			</externalResourceBinding>
>> 		</externalResourceBindings>
>> 	</resourceManagerConfiguration>
>> </taeDescription>
>> [Kothuvatiparambil, Viju] 
>> 
>> ----------------------------------------------------------------------
>> This message, and any attachments, is for the intended recipient(s) only, may contain
information that is privileged, confidential and/or proprietary and subject to important terms
and conditions available at http://www.bankofamerica.com/emaildisclaimer.   If you are not
the intended recipient, please delete this message.
> 
> ----------------------------------------------------------------------
> This message, and any attachments, is for the intended recipient(s) only, may contain
information that is privileged, confidential and/or proprietary and subject to important terms
and conditions available at http://www.bankofamerica.com/emaildisclaimer.   If you are not
the intended recipient, please delete this message.


Mime
View raw message