Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 67778 invoked from network); 22 Apr 2009 09:21:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 22 Apr 2009 09:21:01 -0000 Received: (qmail 7649 invoked by uid 500); 22 Apr 2009 09:13:07 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 7617 invoked by uid 500); 22 Apr 2009 09:13:07 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 7609 invoked by uid 99); 22 Apr 2009 09:13:07 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Apr 2009 09:13:07 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of earwin@gmail.com designates 209.85.218.179 as permitted sender) Received: from [209.85.218.179] (HELO mail-bw0-f179.google.com) (209.85.218.179) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Apr 2009 09:13:00 +0000 Received: by bwz27 with SMTP id 27so2983732bwz.5 for ; Wed, 22 Apr 2009 02:12:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=y2mTsaDXh9NCQ36SIcvHwYpLEQ4Llr9Ov/ZNQblenuQ=; b=jZsCufTopowiWz8eIRoYDCCUdppB13vSowG7muX/dEV+dozzlDpVZemtzU+HY6yrx+ ruwYb2fCjTQlh3dRbeyD7S3GjrnnRte/kDRH514XDyC+HkaCZJbR4gNAFl7sD9Aq1kHD 0ZxsTP7mpY+FgnaNjXNZHVRzkJKIWEeWP1Z/k= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=g2PVW6bDo1hR4gUzjOMeuDziYcO5kbw0vlDA2ZjSnCzYyxtGREbCi6F/Kl1lLlmR0t bPPxntBHsAWyyoj3Vj1WOOX1qC0ZjzouHZaLlyCs3Z8xbgc74sFjfsI+sfEZYWwLrgXS LYu7neOuv/OA61vpG9H0FxftWNEgISWofrdTc= MIME-Version: 1.0 Received: by 10.204.53.143 with SMTP id m15mr7412397bkg.119.1240391557861; Wed, 22 Apr 2009 02:12:37 -0700 (PDT) In-Reply-To: <49EEDA2F.4050904@cs.put.poznan.pl> References: <49EEDA2F.4050904@cs.put.poznan.pl> Date: Wed, 22 Apr 2009 13:12:37 +0400 Message-ID: <59b3eb370904220212l3748e0we72aebb0230fec9f@mail.gmail.com> Subject: Re: Synonym filter with support for phrases? From: Earwin Burrfoot To: java-dev@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org > Hello everyone, > > I'm looking for feedback and thoughts on the following problem (it's more= of > development than user-centered problem, hope the dev list is appropriate)= : > > - a token stream is given, > > - a set of "synonyms" is given, where synonyms are token sequences to be > matched and token sequences to be added as synonyms. > > An example to make things clearer (apologies for lame synonyms). Given a = set > of synonyms like this: > > {"new", "york"} -> { > =C2=A0 =C2=A0 =C2=A0 =C2=A0{"big", "apple"}}, > > {"restaurant"} =C2=A0-> { > =C2=A0 =C2=A0 =C2=A0 =C2=A0{"diner"}, > =C2=A0 =C2=A0 =C2=A0 =C2=A0{"food", "place"}, > =C2=A0 =C2=A0 =C2=A0 =C2=A0{"full", "belly"}} > } > > a token stream (I try to indicate positional information here): > > 0 | 1 =C2=A0 | 2 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| 3 =C2=A0| 4 =C2=A0 |= 5 > a | new | restaurant | in | new | york > > would be enriched to an index of (note overlapping tokens on the same > positions): > > 0 | 1 =C2=A0 | 2 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| 3 =C2=A0 =C2=A0 | 4 = =C2=A0 | 5 > a | new | restaurant | in =C2=A0 =C2=A0| new | york > =C2=A0| =C2=A0 =C2=A0 | diner =C2=A0 =C2=A0 =C2=A0| =C2=A0 =C2=A0 =C2=A0 = | big | apple > =C2=A0| =C2=A0 =C2=A0 | food =C2=A0 =C2=A0 =C2=A0 | place | =C2=A0 =C2=A0= | > =C2=A0| =C2=A0 =C2=A0 | full =C2=A0 =C2=A0 =C2=A0 | belly | =C2=A0 =C2=A0= | > > The point is for phrase queries to work for synonyms and for the original > text (of course multi-word synonyms longer than the original phrase would > overlap with the text, but this shouldn't be much of a worry). > > In the current Lucene's trunk there is a synonym filter, but its > implementation is not really suitable for achieving the above. I wrote a > token filter that implements the above functionality, but then I thought > that synonyms would be something frequently dealt with so my questions ar= e: > > a) are there any thoughts on how the above could be implemented using > existing Lucene infrastructure (perhaps I missed something obvious), > > b) if (a) is not applicable, would such a token filter constitute a usefu= l > addition to Lucene? Your synonyms will break if you try searching for phrases. Building on your example, "food place in new york" will find nothing, because 'place' and 'in' share the same position. I've implemented multiword synonyms on my project, it works, but is really hairy. While building the index, I inject synonym group ids instead of actual words, then I detect synonyms in queries and replace them with group ids too. Hard part comes after that, you have to adjust positionIncrements on syngroup id tokens, with respect to the longest synonym contained in that group, then you have to treat overlapping synonyms. When query rewrite is finished, I end up with a mixture of Term/Phrase/MultiPhrase/SpanQueries :) More correct approach is to index as-is and expand queries with actual synonym phrases instead of ids, but then queries become really humongous if you have any decent synonym dictionary (I have 20+ phrase groups). --=20 Kirill Zakharenko/=D0=9A=D0=B8=D1=80=D0=B8=D0=BB=D0=BB =D0=97=D0=B0=D1=85= =D0=B0=D1=80=D0=B5=D0=BD=D0=BA=D0=BE (earwin@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org