From user-return-4813-apmail-uima-user-archive=uima.apache.org@uima.apache.org Wed Mar 20 12:39:25 2013 Return-Path: X-Original-To: apmail-uima-user-archive@www.apache.org Delivered-To: apmail-uima-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C43DFFB21 for ; Wed, 20 Mar 2013 12:39:25 +0000 (UTC) Received: (qmail 44834 invoked by uid 500); 20 Mar 2013 12:39:25 -0000 Delivered-To: apmail-uima-user-archive@uima.apache.org Received: (qmail 44616 invoked by uid 500); 20 Mar 2013 12:39:24 -0000 Mailing-List: contact user-help@uima.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@uima.apache.org Delivered-To: mailing list user@uima.apache.org Received: (qmail 44270 invoked by uid 99); 20 Mar 2013 12:39:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Mar 2013 12:39:21 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [139.18.1.28] (HELO v3.rz.uni-leipzig.de) (139.18.1.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Mar 2013 12:39:15 +0000 Received: from localhost (localhost [127.0.0.1]) by v3.rz.uni-leipzig.de (Postfix) with ESMTP id CFEA12C04F for ; Wed, 20 Mar 2013 13:38:53 +0100 (CET) X-Virus-Scanned: by amavisd-new at v3-ul Received: from v3.rz.uni-leipzig.de ([127.0.0.1]) by localhost (v3.rz.uni-leipzig.de [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 1VN3yDoR3UlK for ; Wed, 20 Mar 2013 13:38:53 +0100 (CET) Received: from isun.informatik.uni-leipzig.de (isun.informatik.uni-leipzig.de [139.18.13.50]) by v3.rz.uni-leipzig.de (Postfix) with ESMTP id BA1542C04E for ; Wed, 20 Mar 2013 13:38:53 +0100 (CET) Received: from smtp.informatik.uni-leipzig.de (smtp.informatik.uni-leipzig.de [139.18.13.51]) by isun.informatik.uni-leipzig.de (8.14.4+Sun/8.14.4) with ESMTP id r2KCcrew015600 for ; Wed, 20 Mar 2013 13:38:53 +0100 (CET) Received: from [141.57.112.20] ([141.57.112.20]) (authenticated bits=0) by smtp.informatik.uni-leipzig.de (8.14.3/8.14.3) with ESMTP id r2KCcrjm008523 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Wed, 20 Mar 2013 13:38:53 +0100 (CET) Message-ID: <5149ADD8.5040908@informatik.uni-leipzig.de> Date: Wed, 20 Mar 2013 13:38:48 +0100 From: Andreas Niekler User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: user@uima.apache.org Subject: Re: ConceptMApper References: <514998D0.5060408@informatik.uni-leipzig.de> <37015975-4DF3-46DA-BABB-A2BAA0070626@park-slope.net> In-Reply-To: <37015975-4DF3-46DA-BABB-A2BAA0070626@park-slope.net> X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org This is how my dict looks like: Am 20.03.2013 12:26, schrieb Michael Tanenblatt: > I have never seen this issue--under no circumstances should anything less than the full dictionary entry be matched. The only things I can think of are either errors in the dictionary, though that's unlikely, or issues with the tokenizer. Or a bug… My guess is that the dictionary entry, "FC Barcelona" is being tokenized such that only "FC" is annotated, therefore that is the only part that needs to match. You can test if it is a tokenization issue by using the sample whitespace tokenizer that comes with ConceptMapper just to test and see what results you get. > > > On Mar 20, 2013, at 7:09 AM, Andreas Niekler wrote: > >> Hello, >> >> i try to use the ConceptMapper to annotate Multi Word Units in german. I >> face the problem that all the tokens within the dictionary are matched >> somehow like. >> >> In the dict -> FC Barcelona >> >> Annotated in a Text "The FC scored today" FC is annotated as DictEntry >> >> Why does conceptMapper annotate this. Here are my Parameters >> >> AnalysisEngineDescription mapper = >> AnalysisEngineFactory.createPrimitiveDescription( >> ConceptMapper.class, >> ts, >> ConceptMapper.PARAM_ANNOTATION_NAME, >> "org.apache.uima.conceptMapper.DictTerm", >> ConceptMapper.PARAM_ENCLOSINGSPAN, "enclosingSpan", >> ConceptMapper.PARAM_TOKENANNOTATION, "opennlp.uima.Token", >> ConceptMapper.PARAM_ATTRIBUTE_LIST, new String[] {"canonical"}, >> ConceptMapper.PARAM_FEATURE_LIST, new String[] {"DictCanon"}, >> ConceptMapper.PARAM_MATCHEDFEATURE, "matchedText", >> ConceptMapper.PARAM_TOKENIZERDESCRIPTOR, "TokenizerDE.xml", >> //ConceptMapper.PARAM_DATA_BLOCK_FS, "uima.tcas.DocumentAnnotation", >> ConceptMapper.PARAM_DATA_BLOCK_FS, "opennlp.uima.Sentence", >> ConceptMapper.PARAM_SEARCHSTRATEGY, "ContiguousMatch", >> ConceptMapper.PARAM_MATCHEDTOKENSFEATURENAME, "matchedTokens", >> TokenNormalizer.PARAM_CASE_MATCH, "ignoreall"); >> >> Thank you >> >> Andreas > > -- Andreas Niekler, Dipl. Ing. (FH) NLP Group | Department of Computer Science University of Leipzig Johannisgasse 26 | 04103 Leipzig mail: aniekler@informatik.uni-leipzig.deg.de