opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damiano Porta <damianopo...@gmail.com>
Subject Re: How to handle big dictionaries to find typos
Date Mon, 14 Sep 2015 12:52:53 GMT
Yes Catalin, I was using DictionaryNameFinder for NER. But unfortunately it
does not support misspellings at the moment. So i have to migrate that
dictionary to a Lucene Index.

Thank you!

2015-09-14 14:46 GMT+02:00 Cătălin M. <catalinmititelu@gmail.com>:

> Yes, you have right. You can replace DictionaryNameFinder with a Lucene
> index. When you mentioned DictionaryNameFinder I was thinking at Name
> entity recognition module (tagging being done using a NER model).
>
> Sorry for this misunderstanding.
>
> BR,
> Catalin
>
>
> On 09/14/2015 03:31 PM, Damiano Porta wrote:
>
>> HI Catalin,
>> than you so much for you help.
>>
>> Yes I found Lucene's FuzzyQuery, but i did not understand one passage.
>> When
>> I check the term (with typos) against a Lucene Index to find the correct
>> form, why do I have to use DictionaryNameFinder? I mean..
>>
>> 1. I can create an index with all the correct names
>> 2. CHecking each token against that index to find a match or a word (with
>> a
>> specific "distance")
>> 3. If I found something i "tag" that word as city without using
>> DictionaryNameFinder.
>>
>> I mean, my "dictionary" will be this Lucene's index.
>> Correct?
>>
>> Thank you!
>> Damiano
>>
>>
>>
>> 2015-09-14 13:10 GMT+02:00 Cătălin M. <catalinmititelu@gmail.com>:
>>
>> A solution might be to check typos (Gogle, Gooogle) against a Lucene index
>>> that would contain your dictionary of companies, too. Using the
>>> FuzzyQuery
>>> you would find the correct form => "Google" and then use this correct orm
>>> in your DictionaryNameFinder.
>>>
>>> Please let me know if it seems feasible.
>>>
>>> BR,
>>> Catalin
>>>
>>>
>>>
>>> On 09/13/2015 10:35 PM, Damiano Porta wrote:
>>>
>>> Hi Catalin,
>>>> Can i use it with DictionaryNameFinder?
>>>> Thanks
>>>> Damiano
>>>>
>>>> Il giorno Dom 13 Set 2015 21:08 Catalin Mititelu <
>>>> catalinmititelu@gmail.com>
>>>> ha scritto:
>>>>
>>>> Hi Damiano,
>>>>
>>>>> You may try Lucene fuzzy query which is based on Levenstein distance.
>>>>>
>>>>> BR,
>>>>> Catalin
>>>>>
>>>>> On 09/13/2015 09:59 PM, Damiano Porta wrote:
>>>>>
>>>>> Hello,
>>>>>>
>>>>>> I have created a very big dictionary of companies, it is around 3M.
>>>>>> At the moment i am using DictionaryNameFinder class, but I need to
>>>>>> implement something to find typos like Gogle/Gooogle Inc etc.
>>>>>> I read something about leveinstain distance, is this implementend
in
>>>>>> OpenNLP?
>>>>>> It seems good but i read it takes a lot of times if the words are
many
>>>>>>
>>>>>> (my
>>>>>
>>>>> case).
>>>>>>
>>>>>> What should i do?
>>>>>> Thanks!
>>>>>> Damiano
>>>>>>
>>>>>>
>>>>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message