Return-Path: X-Original-To: apmail-opennlp-dev-archive@www.apache.org Delivered-To: apmail-opennlp-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 36BB518C58 for ; Mon, 14 Sep 2015 12:53:01 +0000 (UTC) Received: (qmail 24592 invoked by uid 500); 14 Sep 2015 12:52:57 -0000 Delivered-To: apmail-opennlp-dev-archive@opennlp.apache.org Received: (qmail 24550 invoked by uid 500); 14 Sep 2015 12:52:57 -0000 Mailing-List: contact dev-help@opennlp.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@opennlp.apache.org Delivered-To: mailing list dev@opennlp.apache.org Received: (qmail 24537 invoked by uid 99); 14 Sep 2015 12:52:57 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Sep 2015 12:52:57 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 383F318098C for ; Mon, 14 Sep 2015 12:52:57 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.879 X-Spam-Level: ** X-Spam-Status: No, score=2.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id aZcLPyi2mjB5 for ; Mon, 14 Sep 2015 12:52:56 +0000 (UTC) Received: from mail-la0-f53.google.com (mail-la0-f53.google.com [209.85.215.53]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 7F8DA20562 for ; Mon, 14 Sep 2015 12:52:55 +0000 (UTC) Received: by lahg1 with SMTP id g1so56156094lah.1 for ; Mon, 14 Sep 2015 05:52:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=JSoRsV91nvD74vRR5WW44xl+jAFnOqV0E9TW5wELBnA=; b=cYqbykO4IifMmBXudk2nSfU+4OXF7f7jCeGNDZbezRakdjohRyGfoxThA9vcWILPML Ti7sm7UHoQ7a9FW4z4SxB+lqyuVaN/+viHRDTl5tFXPWENyD26DukVUIix1PKydbXyDd LEnR2jw4MpUCJzULV8cDLOByszB2do6jj4RVsouOKLnduNFqMEyX1+P/XMjyzuM1kGzr 5xqrL0I26uTNABqyYzAnAGC2Fk3d9aBDVQdGcjMgPXz31owantBKaoKjMA4hxxRHF84+ NvolikXwWm4QY6cbVuc4supmegWQVB6L7Dva9fO/C73/dAD6ZISDFhN2Feey8FIk8o77 fEAg== MIME-Version: 1.0 X-Received: by 10.112.199.70 with SMTP id ji6mr13695916lbc.73.1442235173958; Mon, 14 Sep 2015 05:52:53 -0700 (PDT) Received: by 10.25.60.14 with HTTP; Mon, 14 Sep 2015 05:52:53 -0700 (PDT) In-Reply-To: <55F6C18A.5010507@gmail.com> References: <55F5C9A9.3080106@gmail.com> <55F6AB2D.8010705@gmail.com> <55F6C18A.5010507@gmail.com> Date: Mon, 14 Sep 2015 14:52:53 +0200 Message-ID: Subject: Re: How to handle big dictionaries to find typos From: Damiano Porta To: dev@opennlp.apache.org Content-Type: multipart/alternative; boundary=001a11c346383a0102051fb48b65 --001a11c346383a0102051fb48b65 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Yes Catalin, I was using DictionaryNameFinder for NER. But unfortunately it does not support misspellings at the moment. So i have to migrate that dictionary to a Lucene Index. Thank you! 2015-09-14 14:46 GMT+02:00 C=C4=83t=C4=83lin M. = : > Yes, you have right. You can replace DictionaryNameFinder with a Lucene > index. When you mentioned DictionaryNameFinder I was thinking at Name > entity recognition module (tagging being done using a NER model). > > Sorry for this misunderstanding. > > BR, > Catalin > > > On 09/14/2015 03:31 PM, Damiano Porta wrote: > >> HI Catalin, >> than you so much for you help. >> >> Yes I found Lucene's FuzzyQuery, but i did not understand one passage. >> When >> I check the term (with typos) against a Lucene Index to find the correct >> form, why do I have to use DictionaryNameFinder? I mean.. >> >> 1. I can create an index with all the correct names >> 2. CHecking each token against that index to find a match or a word (wit= h >> a >> specific "distance") >> 3. If I found something i "tag" that word as city without using >> DictionaryNameFinder. >> >> I mean, my "dictionary" will be this Lucene's index. >> Correct? >> >> Thank you! >> Damiano >> >> >> >> 2015-09-14 13:10 GMT+02:00 C=C4=83t=C4=83lin M. : >> >> A solution might be to check typos (Gogle, Gooogle) against a Lucene ind= ex >>> that would contain your dictionary of companies, too. Using the >>> FuzzyQuery >>> you would find the correct form =3D> "Google" and then use this correct= orm >>> in your DictionaryNameFinder. >>> >>> Please let me know if it seems feasible. >>> >>> BR, >>> Catalin >>> >>> >>> >>> On 09/13/2015 10:35 PM, Damiano Porta wrote: >>> >>> Hi Catalin, >>>> Can i use it with DictionaryNameFinder? >>>> Thanks >>>> Damiano >>>> >>>> Il giorno Dom 13 Set 2015 21:08 Catalin Mititelu < >>>> catalinmititelu@gmail.com> >>>> ha scritto: >>>> >>>> Hi Damiano, >>>> >>>>> You may try Lucene fuzzy query which is based on Levenstein distance. >>>>> >>>>> BR, >>>>> Catalin >>>>> >>>>> On 09/13/2015 09:59 PM, Damiano Porta wrote: >>>>> >>>>> Hello, >>>>>> >>>>>> I have created a very big dictionary of companies, it is around 3M. >>>>>> At the moment i am using DictionaryNameFinder class, but I need to >>>>>> implement something to find typos like Gogle/Gooogle Inc etc. >>>>>> I read something about leveinstain distance, is this implementend in >>>>>> OpenNLP? >>>>>> It seems good but i read it takes a lot of times if the words are ma= ny >>>>>> >>>>>> (my >>>>> >>>>> case). >>>>>> >>>>>> What should i do? >>>>>> Thanks! >>>>>> Damiano >>>>>> >>>>>> >>>>>> > --001a11c346383a0102051fb48b65--