uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Tanenblatt <sloth...@park-slope.net>
Subject Re: ConceptMApper
Date Wed, 20 Mar 2013 13:16:07 GMT
One thing that looks odd to me is that each entry is the "1. " prefix. Perhaps that is causing
a problem, if the tokenizer is putting a sentence break at that point? Just a guess. 


On Mar 20, 2013, at 8:38 AM, Andreas Niekler <aniekler@informatik.uni-leipzig.de> wrote:

> This is how my dict looks like:
> 
> <?xml version="1.0" encoding="UTF-8" ?>
> <synonym>
> <token canonical="mwu" SemClass="mwu">
> <variant base="1. FC Straubing"/>
> <variant base="1. FC Styrum"/>
> <variant base="1. FC Tatran Presov"/>
> <variant base="1. FC Tatran Prešov"/>
> <variant base="1. FC Trogen"/>
> <variant base="1. FC Union"/>
> <variant base="1. FC Union Berlin"/>
> <variant base="1. FC Union Solingen"/>
> <variant base="1. FC Viersen"/>
> <variant base="1. FC Viersen 05"/>
> <variant base="1. FC Vöcklabruck"/>
> <variant base="1. FC Weißenfels"/>
> <variant base="1. FC Wernigerode"/>
> <variant base="1. FC Wilmersdorf"/>
> <variant base="1. FC Windeck"/>
> <variant base="1. FC Wolfsburg"/>
> <variant base="1. FC Wunstorf"/>
> <variant base="1. FC Zeitz"/>
> <variant base="1. FFC"/>
> <variant base="1. FFC 08 Niederkirchen"/>
> <variant base="1. FFC Fortuna Dresden-Rähnitz"/>
> <variant base="1. FFC Frankfurt"/>
> <variant base="1. FFC Montabaur"/>
> </token>
> </synonym>
> 
> Am 20.03.2013 12:26, schrieb Michael Tanenblatt:
>> I have never seen this issue--under no circumstances should anything less than the
full dictionary entry be matched. The only things I can think of are either errors in the
dictionary, though that's unlikely, or issues with the tokenizer. Or a bug… My guess is
that the dictionary entry, "FC Barcelona" is being tokenized such that only "FC" is annotated,
therefore that is the only part that needs to match. You can test if it is a tokenization
issue by using the sample whitespace tokenizer that comes with ConceptMapper just to test
and see what results you get.
>> 
>> 
>> On Mar 20, 2013, at 7:09 AM, Andreas Niekler <aniekler@informatik.uni-leipzig.de>
wrote:
>> 
>>> Hello,
>>> 
>>> i try to use the ConceptMapper to annotate Multi Word Units in german. I
>>> face the problem that all the tokens within the dictionary are matched
>>> somehow like.
>>> 
>>> In the dict -> FC Barcelona
>>> 
>>> Annotated in a Text "The FC scored today" FC is annotated as DictEntry
>>> 
>>> Why does conceptMapper annotate this. Here are my Parameters
>>> 
>>> AnalysisEngineDescription mapper =
>>> AnalysisEngineFactory.createPrimitiveDescription(
>>> 				ConceptMapper.class,
>>> 				ts,
>>> 				ConceptMapper.PARAM_ANNOTATION_NAME,
>>> "org.apache.uima.conceptMapper.DictTerm",
>>> 	    		ConceptMapper.PARAM_ENCLOSINGSPAN, "enclosingSpan",
>>> 	    		ConceptMapper.PARAM_TOKENANNOTATION, "opennlp.uima.Token",
>>> 	    		ConceptMapper.PARAM_ATTRIBUTE_LIST, new String[] {"canonical"},
>>> 	    		ConceptMapper.PARAM_FEATURE_LIST, new String[] {"DictCanon"},	    		
>>> 	    		ConceptMapper.PARAM_MATCHEDFEATURE, "matchedText",
>>> 	    		ConceptMapper.PARAM_TOKENIZERDESCRIPTOR, "TokenizerDE.xml",
>>> 	    		//ConceptMapper.PARAM_DATA_BLOCK_FS, "uima.tcas.DocumentAnnotation",
>>> 	    		ConceptMapper.PARAM_DATA_BLOCK_FS, "opennlp.uima.Sentence",
>>> 	    		ConceptMapper.PARAM_SEARCHSTRATEGY, "ContiguousMatch",
>>> 	    		ConceptMapper.PARAM_MATCHEDTOKENSFEATURENAME, "matchedTokens",
>>> 	    		TokenNormalizer.PARAM_CASE_MATCH, "ignoreall");
>>> 
>>> Thank you
>>> 
>>> Andreas
>> 
>> 
> 
> -- 
> Andreas Niekler, Dipl. Ing. (FH)
> NLP Group | Department of Computer Science
> University of Leipzig
> Johannisgasse 26 | 04103 Leipzig
> 
> mail: aniekler@informatik.uni-leipzig.deg.de


Mime
View raw message