Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6FA11F29A for ; Sun, 5 May 2013 14:21:51 +0000 (UTC) Received: (qmail 93609 invoked by uid 500); 5 May 2013 14:21:49 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 93513 invoked by uid 500); 5 May 2013 14:21:48 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 93496 invoked by uid 99); 5 May 2013 14:21:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 May 2013 14:21:48 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of gilinachum@gmail.com designates 209.85.219.45 as permitted sender) Received: from [209.85.219.45] (HELO mail-oa0-f45.google.com) (209.85.219.45) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 May 2013 14:21:43 +0000 Received: by mail-oa0-f45.google.com with SMTP id o17so2855975oag.18 for ; Sun, 05 May 2013 07:21:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=2KpQGphzErPMeb0Rm0/LpK6ClXsufA0+HSWtW4BaqqE=; b=BVOX4oRtQ59EipXes3yrLsAm+NF7lrpwdL+4Ji4QmovfqgP3l1fYbV6XXGELS+Pyg/ GvVQwqTRfPMamXHDD1Qe/6AkOAC20Y8k0DJmiEkaWIZPA3iy66DCtF+Qs/aQKubgKYN/ A6ZDm9NVwoheNuGe/vNS3TO4043Dph4/UOZOFd88fiXotiWm7OUFVqGWkisUySul/ZTw ixLkEh3lBA8q8M8fzRburP2fRIW9UwMu6b30J4e+ZNt1/yNRv61ADIM+HiRuYjZdvsZb G9n7wQjWGLkFCPp26tzRFpdVnSyBx5q2hHPoqhW43Rarc8fWXjbD4+0PHU33h2OiB4Is XcpQ== MIME-Version: 1.0 X-Received: by 10.60.35.197 with SMTP id k5mr4631111oej.138.1367763683259; Sun, 05 May 2013 07:21:23 -0700 (PDT) Received: by 10.182.64.33 with HTTP; Sun, 5 May 2013 07:21:23 -0700 (PDT) In-Reply-To: <102B71C8-CC7A-4C1D-8EFA-4192C15B00F9@kodapan.se> References: <102B71C8-CC7A-4C1D-8EFA-4192C15B00F9@kodapan.se> Date: Sun, 5 May 2013 17:21:23 +0300 Message-ID: Subject: Re: Best practices in boosting by proximity? From: Gili Nachum To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=089e011847a67a3f3c04dbf94ced X-Virus-Checked: Checked by ClamAV on apache.org --089e011847a67a3f3c04dbf94ced Content-Type: text/plain; charset=ISO-8859-1 Hi Karl, I guess I must have individual terms in my query, along side the SHOULD phrases with slops, since I don't want to miss on results , even if the terms distance is huge. Slop - will enrich the phrases with them. Shingles - Good idea. I'll index bi-grams if performance because an issue. Indeed, I've used query parser syntax but that's just for communication sake, I'll probably implement this programmatically. Cheers! (even during responding to the Lucene group ;). On Sat, May 4, 2013 at 9:51 PM, Karl Wettin wrote: > I just realized this mail contained several incomplete sentences. I blame > norwegian beers. Please allow me to try it once again: > > The most simple solution is to make use of slop in PhraseQuery, > SpanNearQuery, etc(?). Also consider permutations of #isInOrder() with > alternative query boosts. > > Even though slop will create a greater score the closer the terms are, it > might still in some cases (usually when combined with other subqueries) > make sense to create a BooleanQuery that contains the same query but with a > greater boost to a smaller slop. > > You could also consider using shingles (even in combination with the > above) for matching documents where the distance between two terms is zero. > Generally it's hard to define a best practice. It depends on the corpora > your index represents, your queries and your needs. > > Given your question it looks like you're using the query parser. Try > something like "your proximity query"~20, but consider the cost of a great > slop. > > 4 maj 2013 kl. 20:41 skrev Karl Wettin: > > > The most simple solution is to use of slop in PhraseQuery, > SpanNearQuery, etc(?). Also consider permutations of #isInOrder() with > alternative query boosts. > > > > Even though slop will create a greater score the closer the terms are, > it might still in some cases (usually when combined with other subqueries) > make sense to create a BooleanQuery that contains the same query but with > a greater boost to a smaller slop. > > > > You could also consider using shingles (even in combination with above) > for matching documents where the distance between two terms are. Generally > it's hard to define a best practice. It depends on the corpora your index > represents, your queries and your needs. > > > > Given your question it looks like you're using the query parser. Try > something like "your proximity query"~20, but consider the cost of a great > slop. > > > > > > karl > > > > 4 maj 2013 kl. 19:46 skrev Gili Nachum: > > > >> Hi. *I would like for hits that contain the search terms in proximity to > >> each other to be ranked higher than hits in which the terms are > scattered > >> across the doc. > >> Wondering if there's a best practice to achieve that?* > >> I also want that all hits will contain all of the search terms (implicit > >> AND): > >> > >> *Example:* when users search for: "lannisters always pay their debts", > the > >> 4 matching results should be ranked the following (for simplicity, > assume > >> equal field norms, and TF/IDF, in all hits): > >> 1. "It is known that *Lannisters always pay their debts*" > >> 2. "... Lannisters ... they sometimes *pay their debts* ... always with > you" > >> 3. *"Lannisters always *win ... debts ... pay tax ... their nature" > >> 4. "Lannisters ... always ... pay ... their ... debts" > >> > >> The first result has all 5 terms in proximity to each other. > >> The second has 3 terms in proximity. > >> The third has 2 terms in proximity. > >> The forth has none of the terms in proximity to each other. > >> > >> My current AND query that ignores proximity is: +lannisters +always +pay > >> +their +debts > >> So if there are M terms, I was thinking that I could add M-1 SHOULD > phrase > >> queries to the original query: > >> "lannisters always" "always pay" "pay their" "their debts". > >> > >> What are the pros and cons? Are there alternatives to consider? > >> Any Lucene class that helps achieve this? > >> > >> Thx! > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --089e011847a67a3f3c04dbf94ced--