Return-Path: X-Original-To: apmail-opennlp-issues-archive@www.apache.org Delivered-To: apmail-opennlp-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E6D4B99B9 for ; Sat, 17 Mar 2012 02:24:59 +0000 (UTC) Received: (qmail 81418 invoked by uid 500); 17 Mar 2012 02:24:59 -0000 Delivered-To: apmail-opennlp-issues-archive@opennlp.apache.org Received: (qmail 81387 invoked by uid 500); 17 Mar 2012 02:24:59 -0000 Mailing-List: contact issues-help@opennlp.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: issues@opennlp.apache.org Delivered-To: mailing list issues@opennlp.apache.org Received: (qmail 81378 invoked by uid 500); 17 Mar 2012 02:24:59 -0000 Delivered-To: apmail-incubator-opennlp-issues@incubator.apache.org Received: (qmail 81375 invoked by uid 99); 17 Mar 2012 02:24:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 17 Mar 2012 02:24:59 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 17 Mar 2012 02:24:58 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 2376824211 for ; Sat, 17 Mar 2012 02:24:38 +0000 (UTC) Date: Sat, 17 Mar 2012 02:24:38 +0000 (UTC) From: "James Kosin (Commented) (JIRA)" To: opennlp-issues@incubator.apache.org Message-ID: <1600964740.27397.1331951078146.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <552816194.17307.1331783273585.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (OPENNLP-471) DictionaryNameFinder has HASHing issues MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/OPENNLP-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231808#comment-13231808 ] James Kosin commented on OPENNLP-471: ------------------------------------- I've added code to expand the dictionary search as long as the buid token list is smaller than the dictionary max value. This is done by adding 2 integer values to the Dictionary class to hold the longest and shortest token counts added to the dictionary. We currently use the max count; but, theoretically it can be easily expanded to use the min as well to prevent us from looking for shorter dictionary entries when we have none. The added code to the find() shouldn't turn the search into an N^2 problem; but, should fix the issue of finding dictionary elements that are longer that have no shorter counterparts in the dictionary. NOTE: I went with my last comments above as a solution; because it currently makes the most sence. If anyone has a better idea we can certainly entertain it in this JIRA. Another possible option would be to bring back the Index.... but, we would have to make the StringList export the case sensitivity somehow to the Index so it can work correctly. I can work on that as another option; since the Index allowed us to keep adding tokens as long as the tokens where in the Index. The problem was that case sensitivity issues need to be worked out there. > DictionaryNameFinder has HASHing issues > --------------------------------------- > > Key: OPENNLP-471 > URL: https://issues.apache.org/jira/browse/OPENNLP-471 > Project: OpenNLP > Issue Type: Bug > Components: Name Finder > Reporter: James Kosin > Labels: dictionary, namefinder > > The DictionaryNameFinder has issues finding multi-token names when the dictionary is searched a token at a time by the find() method. If, the dictionary doesn't have a single (or shorter) token match available in the dictionary. > Having a dictionary with {"folic", "acid"} without an entry for {"folic"} will cause the find() method to totally skip the fact there is a longer match possible. > Thanks to Jim for pushing this and to my debugging skills to find. > Two possiblilites come to mind: > 1) I don't really like, is we turn it into a larger problem by trying longer matches when shorter ones don't match. Unfortunately, this turns quickly into a race to see who can wait longer. > 2) A way of returning a possible match that may need exploring, or a look-ahead type system to say we don't match "folic" but if you have "acid" after "folic" we have a match for that in the dictionary. > 3) Leave it as is and modify the dictionary to add shorter terms to the dictionary... maybe marking as not-a-valid entry so we can know we need a longer match. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira