Mailing-List: contact dev-help@opennlp.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@opennlp.apache.org
MIME-Version: 1.0
In-Reply-To: <55F6C18A.5010507@gmail.com>
References: 
 <CALxjdk2sU4pBeZaB4x+Xjd=YiP6=85eQoWz20h4veUJ4987meQ@mail.gmail.com>
	<55F5C9A9.3080106@gmail.com>
	<CALxjdk0jknyyJ16RTUD_Sb+c0AOMk6TsU3gsyV1+9ak6-+Dtxg@mail.gmail.com>
	<55F6AB2D.8010705@gmail.com>
	<CALxjdk0D7uw+LLVHounb8xKam+E3zd7fJZy5tOeX8sL2MyX8RQ@mail.gmail.com>
	<55F6C18A.5010507@gmail.com>
Date: Mon, 14 Sep 2015 14:52:53 +0200
Message-ID: 
 <CALxjdk0FDjSUMTau66mAYe3A0Ym3Sx4hZz2tdcKjkb0+8DOypQ@mail.gmail.com>
Subject: Re: How to handle big dictionaries to find typos
From: Damiano Porta <damianoporta@gmail.com>
To: dev@opennlp.apache.org
Content-Type: multipart/alternative; boundary=001a11c346383a0102051fb48b65

--001a11c346383a0102051fb48b65
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Yes Catalin, I was using DictionaryNameFinder for NER. But unfortunately it
does not support misspellings at the moment. So i have to migrate that
dictionary to a Lucene Index.

Thank you!

2015-09-14 14:46 GMT+02:00 C=C4=83t=C4=83lin M. <catalinmititelu@gmail.com>=
:

> Yes, you have right. You can replace DictionaryNameFinder with a Lucene
> index. When you mentioned DictionaryNameFinder I was thinking at Name
> entity recognition module (tagging being done using a NER model).
>
> Sorry for this misunderstanding.
>
> BR,
> Catalin
>
>
> On 09/14/2015 03:31 PM, Damiano Porta wrote:
>
>> HI Catalin,
>> than you so much for you help.
>>
>> Yes I found Lucene's FuzzyQuery, but i did not understand one passage.
>> When
>> I check the term (with typos) against a Lucene Index to find the correct
>> form, why do I have to use DictionaryNameFinder? I mean..
>>
>> 1. I can create an index with all the correct names
>> 2. CHecking each token against that index to find a match or a word (wit=
h
>> a
>> specific "distance")
>> 3. If I found something i "tag" that word as city without using
>> DictionaryNameFinder.
>>
>> I mean, my "dictionary" will be this Lucene's index.
>> Correct?
>>
>> Thank you!
>> Damiano
>>
>>
>>
>> 2015-09-14 13:10 GMT+02:00 C=C4=83t=C4=83lin M. <catalinmititelu@gmail.c=
om>:
>>
>> A solution might be to check typos (Gogle, Gooogle) against a Lucene ind=
ex
>>> that would contain your dictionary of companies, too. Using the
>>> FuzzyQuery
>>> you would find the correct form =3D> "Google" and then use this correct=
 orm
>>> in your DictionaryNameFinder.
>>>
>>> Please let me know if it seems feasible.
>>>
>>> BR,
>>> Catalin
>>>
>>>
>>>
>>> On 09/13/2015 10:35 PM, Damiano Porta wrote:
>>>
>>> Hi Catalin,
>>>> Can i use it with DictionaryNameFinder?
>>>> Thanks
>>>> Damiano
>>>>
>>>> Il giorno Dom 13 Set 2015 21:08 Catalin Mititelu <
>>>> catalinmititelu@gmail.com>
>>>> ha scritto:
>>>>
>>>> Hi Damiano,
>>>>
>>>>> You may try Lucene fuzzy query which is based on Levenstein distance.
>>>>>
>>>>> BR,
>>>>> Catalin
>>>>>
>>>>> On 09/13/2015 09:59 PM, Damiano Porta wrote:
>>>>>
>>>>> Hello,
>>>>>>
>>>>>> I have created a very big dictionary of companies, it is around 3M.
>>>>>> At the moment i am using DictionaryNameFinder class, but I need to
>>>>>> implement something to find typos like Gogle/Gooogle Inc etc.
>>>>>> I read something about leveinstain distance, is this implementend in
>>>>>> OpenNLP?
>>>>>> It seems good but i read it takes a lot of times if the words are ma=
ny
>>>>>>
>>>>>> (my
>>>>>
>>>>> case).
>>>>>>
>>>>>> What should i do?
>>>>>> Thanks!
>>>>>> Damiano
>>>>>>
>>>>>>
>>>>>>
>

--001a11c346383a0102051fb48b65--