Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 606CAF5EF for ; Wed, 20 Mar 2013 14:56:44 +0000 (UTC) Received: (qmail 88933 invoked by uid 500); 20 Mar 2013 14:56:44 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 88646 invoked by uid 500); 20 Mar 2013 14:56:41 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 88492 invoked by uid 99); 20 Mar 2013 14:56:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Mar 2013 14:56:40 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [139.18.1.26] (HELO v1.rz.uni-leipzig.de) (139.18.1.26) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Mar 2013 14:56:34 +0000 Received: from localhost (localhost [127.0.0.1]) by v1.rz.uni-leipzig.de (Postfix) with ESMTP id CAD66F806C for ; Wed, 20 Mar 2013 15:56:12 +0100 (CET) X-Virus-Scanned: by amavisd-new at v1-ul Received: from v1.rz.uni-leipzig.de ([127.0.0.1]) by localhost (v1.rz.uni-leipzig.de [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id gP5Y9KzTXtlG for ; Wed, 20 Mar 2013 15:56:12 +0100 (CET) Received: from isun.informatik.uni-leipzig.de (isun.informatik.uni-leipzig.de [139.18.13.50]) by v1.rz.uni-leipzig.de (Postfix) with ESMTP id B7EDFF805A for ; Wed, 20 Mar 2013 15:56:12 +0100 (CET) Received: from smtp.informatik.uni-leipzig.de (smtp.informatik.uni-leipzig.de [139.18.13.51]) by isun.informatik.uni-leipzig.de (8.14.4+Sun/8.14.4) with ESMTP id r2KEuBea025301 for ; Wed, 20 Mar 2013 15:56:11 +0100 (CET) Received: from [141.57.112.20] ([141.57.112.20]) (authenticated bits=0) by smtp.informatik.uni-leipzig.de (8.14.3/8.14.3) with ESMTP id r2KEuCj2008726 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Wed, 20 Mar 2013 15:56:12 +0100 (CET) Message-ID: <5149CE07.4050008@informatik.uni-leipzig.de> Date: Wed, 20 Mar 2013 15:56:07 +0100 From: Andreas Niekler User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: user@uima.apache.org Subject: Re: ConceptMApper References: <514998D0.5060408@informatik.uni-leipzig.de> <37015975-4DF3-46DA-BABB-A2BAA0070626@park-slope.net> <5149ADD8.5040908@informatik.uni-leipzig.de> In-Reply-To: X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org I will investigate if this is the case. I will try to only use a whitespace Tokenizer. Is there any information if the DictionaryAnnotator would help me more then? And if so is it as fast as the conceptmapper? Thanks for clarification Andreas Am 20.03.2013 14:16, schrieb Michael Tanenblatt: > One thing that looks odd to me is that each entry is the "1. " prefix. Perhaps that is causing a problem, if the tokenizer is putting a sentence break at that point? Just a guess. > > > On Mar 20, 2013, at 8:38 AM, Andreas Niekler wrote: > >> This is how my dict looks like: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Am 20.03.2013 12:26, schrieb Michael Tanenblatt: >>> I have never seen this issue--under no circumstances should anything less than the full dictionary entry be matched. The only things I can think of are either errors in the dictionary, though that's unlikely, or issues with the tokenizer. Or a bug� My guess is that the dictionary entry, "FC Barcelona" is being tokenized such that only "FC" is annotated, therefore that is the only part that needs to match. You can test if it is a tokenization issue by using the sample whitespace tokenizer that comes with ConceptMapper just to test and see what results you get. >>> >>> >>> On Mar 20, 2013, at 7:09 AM, Andreas Niekler wrote: >>> >>>> Hello, >>>> >>>> i try to use the ConceptMapper to annotate Multi Word Units in german. I >>>> face the problem that all the tokens within the dictionary are matched >>>> somehow like. >>>> >>>> In the dict -> FC Barcelona >>>> >>>> Annotated in a Text "The FC scored today" FC is annotated as DictEntry >>>> >>>> Why does conceptMapper annotate this. Here are my Parameters >>>> >>>> AnalysisEngineDescription mapper = >>>> AnalysisEngineFactory.createPrimitiveDescription( >>>> ConceptMapper.class, >>>> ts, >>>> ConceptMapper.PARAM_ANNOTATION_NAME, >>>> "org.apache.uima.conceptMapper.DictTerm", >>>> ConceptMapper.PARAM_ENCLOSINGSPAN, "enclosingSpan", >>>> ConceptMapper.PARAM_TOKENANNOTATION, "opennlp.uima.Token", >>>> ConceptMapper.PARAM_ATTRIBUTE_LIST, new String[] {"canonical"}, >>>> ConceptMapper.PARAM_FEATURE_LIST, new String[] {"DictCanon"}, >>>> ConceptMapper.PARAM_MATCHEDFEATURE, "matchedText", >>>> ConceptMapper.PARAM_TOKENIZERDESCRIPTOR, "TokenizerDE.xml", >>>> //ConceptMapper.PARAM_DATA_BLOCK_FS, "uima.tcas.DocumentAnnotation", >>>> ConceptMapper.PARAM_DATA_BLOCK_FS, "opennlp.uima.Sentence", >>>> ConceptMapper.PARAM_SEARCHSTRATEGY, "ContiguousMatch", >>>> ConceptMapper.PARAM_MATCHEDTOKENSFEATURENAME, "matchedTokens", >>>> TokenNormalizer.PARAM_CASE_MATCH, "ignoreall"); >>>> >>>> Thank you >>>> >>>> Andreas >>> >>> >> >> -- >> Andreas Niekler, Dipl. Ing. (FH) >> NLP Group | Department of Computer Science >> University of Leipzig >> Johannisgasse 26 | 04103 Leipzig >> >> mail: aniekler@informatik.uni-leipzig.deg.de > > -- Andreas Niekler, Dipl. Ing. (FH) NLP Group | Department of Computer Science University of Leipzig Johannisgasse 26 | 04103 Leipzig mail: aniekler@informatik.uni-leipzig.deg.de