Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 82821 invoked from network); 23 Apr 2009 10:46:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 23 Apr 2009 10:46:54 -0000 Received: (qmail 33891 invoked by uid 500); 23 Apr 2009 10:46:53 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 33788 invoked by uid 500); 23 Apr 2009 10:46:53 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 33777 invoked by uid 99); 23 Apr 2009 10:46:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Apr 2009 10:46:53 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of earwin@gmail.com designates 72.14.220.154 as permitted sender) Received: from [72.14.220.154] (HELO fg-out-1718.google.com) (72.14.220.154) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Apr 2009 10:46:45 +0000 Received: by fg-out-1718.google.com with SMTP id e12so30640fga.4 for ; Thu, 23 Apr 2009 03:46:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=Yw2UakRn9vWaoVFtfAGCiP9gEtHS4m0PQ4qKPmTakiU=; b=JnWJAxbckgeX1RAPCdWtiSE+CcOTrSsQsBTz6gHpphLsTtHSZd2Ce19KbZj828ZixT /Z35X402Lnkhwx41y/C87AhXgzNomOmRKWUa3vb3HZ5X6UF97orRx28qtuS7LrQHFRsg kodJ+wUWvKFfhO315vrwehjGu5I1xS2zJG030= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=Mrs5yWWbOiTAxRAYIkMwJfReGa57DLm2EPFkBX5YQ/NeoPsD0w5CAslM0rcMW3tWAz gZur5rs6ImpO+oT7QLlPDJ7IuMNTmhAvLYdc0sCQV46ZFFqOP7+jxWhoMYQvpjndUyFv mg2nX66/VI2pAV1/BXx7WJ1KaEo2F+0Z63zTc= MIME-Version: 1.0 Received: by 10.86.68.1 with SMTP id q1mr117429fga.34.1240483585362; Thu, 23 Apr 2009 03:46:25 -0700 (PDT) In-Reply-To: <9ac0c6aa0904230250r3ef1349cu5ae8b3c04586c0c6@mail.gmail.com> References: <49EEDA2F.4050904@cs.put.poznan.pl> <59b3eb370904220212l3748e0we72aebb0230fec9f@mail.gmail.com> <9ac0c6aa0904230250r3ef1349cu5ae8b3c04586c0c6@mail.gmail.com> Date: Thu, 23 Apr 2009 14:46:25 +0400 Message-ID: <59b3eb370904230346u45d04466sfcf6defe2a31a493@mail.gmail.com> Subject: Re: Synonym filter with support for phrases? From: Earwin Burrfoot To: java-dev@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org > On Wed, Apr 22, 2009 at 5:12 AM, Earwin Burrfoot wrote= : > >> Your synonyms will break if you try searching for phrases. >> Building on your example, "food place in new york" will find nothing, >> because 'place' and 'in' share the same position. > > It'd be great to get multi-word synonyms fully working... > > How would you change how Lucene indexes token positions to do this "corre= ctly"? You need an ability to put two tokens in the same position, with different posIncrements. One variant from the top of my head is to introduce a notion of span, so token becomes (text, span, incr). (restaurant, 1, 0), (food, 0, 1), (place, 0, 1), (in, 0, 1), (new, 0, 1), (york, 0, 1) The span affects distance calculation between this term, and some that foll= ows. E.g. dist(food, in) =3D 2, because both food and place have incr=3D1, but despite restaurant and food having same start position, dist(restaurant, in) =3D 1, because restaurant spans an additional position. With something like that I think it is possible to formulate an algorithm for indexing and query rewriting that does "correct" multiword synonyms. Right now I cheat when rewriting a query. If my syngroup is a part of the phrase, and I know that this syngroup has longer phrases than the one currently detected, I do a span or sloppy phrase query. That works, but theoretically could match a wrong document. --=20 Kirill Zakharenko/=D0=9A=D0=B8=D1=80=D0=B8=D0=BB=D0=BB =D0=97=D0=B0=D1=85= =D0=B0=D1=80=D0=B5=D0=BD=D0=BA=D0=BE (earwin@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org