opennlp-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "William Colen (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (OPENNLP-471) DictionaryNameFinder has HASHing issues
Date Sat, 17 Mar 2012 13:25:38 GMT

    [ https://issues.apache.org/jira/browse/OPENNLP-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231960#comment-13231960
] 

William Colen commented on OPENNLP-471:
---------------------------------------

One way I see to make it faster would be to convert it to a state machine problem. We would
need to create a new data structure from the dictionary, but it would let us know if we can
keep trying more tokens or not. We would only advance tokens if we are in a known state.

Another alternative would be to add a method to Dictionary that can do partial match. Maybe
returning an enumerator with "partial match", "complete match", "no match". Again we would
be able to keep trying incrementing the token list only if we have a partial match.

But I don't really know this part of the code. Maybe none of my alternatives can be implemented
:)
                
> DictionaryNameFinder has HASHing issues
> ---------------------------------------
>
>                 Key: OPENNLP-471
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-471
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Name Finder
>            Reporter: James Kosin
>            Assignee: James Kosin
>              Labels: dictionary, namefinder
>             Fix For: tools-1.5.3
>
>
> The DictionaryNameFinder has issues finding multi-token names when the dictionary is
searched a token at a time by the find() method.  If, the dictionary doesn't have a single
(or shorter) token match available in the dictionary.
> Having a dictionary with {"folic", "acid"} without an entry for {"folic"} will cause
the find() method to totally skip the fact there is a longer match possible.
> Thanks to Jim for pushing this and to my debugging skills to find.
> Two possiblilites come to mind:
> 1)  I don't really like, is we turn it into a larger problem by trying longer matches
when shorter ones don't match.  Unfortunately, this turns quickly into a race to see who can
wait longer.
> 2)  A way of returning a possible match that may need exploring, or a look-ahead type
system to say we don't match "folic" but if you have "acid" after "folic" we have a match
for that in the dictionary.
> 3)  Leave it as is and modify the dictionary to add shorter terms to the dictionary...
maybe marking as not-a-valid entry so we can know we need a longer match.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message