Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of gilinachum@gmail.com
 designates 209.85.219.45 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <102B71C8-CC7A-4C1D-8EFA-4192C15B00F9@kodapan.se>
References: 
 <CAKFGv-F7k_JFe9yAADzF6Z36C1OPOiYn0koeNxTswUoJ2TTjQg@mail.gmail.com>
	<E6EDFDA9-7568-4E0A-94FD-9D7225E32C1B@kodapan.se>
	<102B71C8-CC7A-4C1D-8EFA-4192C15B00F9@kodapan.se>
Date: Sun, 5 May 2013 17:21:23 +0300
Message-ID: 
 <CAKFGv-EOHGyi_Kq8-tqYqn9tNsO3WdCDdi-gkgkyWhowd0O0ag@mail.gmail.com>
Subject: Re: Best practices in boosting by proximity?
From: Gili Nachum <gilinachum@gmail.com>
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=089e011847a67a3f3c04dbf94ced

--089e011847a67a3f3c04dbf94ced
Content-Type: text/plain; charset=ISO-8859-1

Hi Karl,

I guess I must have individual terms in my query, along side the SHOULD
phrases with slops, since I don't want to miss on results , even if the
terms distance is huge.

Slop - will enrich the phrases with them.
Shingles - Good idea. I'll index bi-grams if performance because an issue.

Indeed, I've used query parser syntax but that's just for communication
sake, I'll probably implement this programmatically.

Cheers! (even during responding to the Lucene group ;).


On Sat, May 4, 2013 at 9:51 PM, Karl Wettin <karl.wettin@kodapan.se> wrote:

> I just realized this mail contained several incomplete sentences. I blame
> norwegian beers. Please allow me to try it once again:
>
> The most simple solution is to make use of slop in PhraseQuery,
> SpanNearQuery, etc(?). Also consider permutations of  #isInOrder() with
> alternative query boosts.
>
> Even though slop will create a greater score the closer the terms are, it
> might still in some cases (usually when combined with other subqueries)
> make sense to create a BooleanQuery that contains the same query but with a
> greater boost to a smaller slop.
>
> You could also consider using shingles (even in combination with the
> above) for matching documents where the distance between two terms is zero.
> Generally it's hard to define a best practice. It depends on the corpora
> your index represents, your queries and your needs.
>
> Given your question it looks like you're using the query parser. Try
> something like "your proximity query"~20, but consider the cost of a great
> slop.
>
> 4 maj 2013 kl. 20:41 skrev Karl Wettin:
>
> > The most simple solution is to use of slop in PhraseQuery,
> SpanNearQuery, etc(?). Also consider permutations of  #isInOrder() with
> alternative query boosts.
> >
> > Even though slop will create a greater score the closer the terms are,
> it might still in some cases (usually when combined with other subqueries)
>  make sense to create a BooleanQuery that contains the same query but with
> a greater boost to a smaller slop.
> >
> > You could also consider using shingles (even in combination with above)
> for matching documents where the distance between two terms are. Generally
> it's hard to define a best practice. It depends on the corpora your index
> represents, your queries and your needs.
> >
> > Given your question it looks like you're using the query parser. Try
> something like "your proximity query"~20, but consider the cost of a great
> slop.
> >
> >
> >               karl
> >
> > 4 maj 2013 kl. 19:46 skrev Gili Nachum:
> >
> >> Hi. *I would like for hits that contain the search terms in proximity to
> >> each other to be ranked higher than hits in which the terms are
> scattered
> >> across the doc.
> >> Wondering if there's a best practice to achieve that?*
> >> I also want that all hits will contain all of the search terms (implicit
> >> AND):
> >>
> >> *Example:* when users search for: "lannisters always pay their debts",
> the
> >> 4 matching results should be ranked the following (for simplicity,
> assume
> >> equal field norms, and TF/IDF, in all hits):
> >> 1. "It is known that *Lannisters always pay their debts*"
> >> 2. "... Lannisters ... they sometimes *pay their debts* ... always with
> you"
> >> 3. *"Lannisters always *win ... debts ... pay tax ... their nature"
> >> 4. "Lannisters ... always ... pay ... their ... debts"
> >>
> >> The first result has all 5 terms in proximity to each other.
> >> The second has 3 terms in proximity.
> >> The third has 2 terms in proximity.
> >> The forth has none of the terms in proximity to each other.
> >>
> >> My current AND query that ignores proximity is: +lannisters +always +pay
> >> +their +debts
> >> So if there are M terms, I was thinking that I could add M-1 SHOULD
> phrase
> >> queries to the original query:
> >> "lannisters always" "always pay" "pay their" "their debts".
> >>
> >> What are the pros and cons? Are there alternatives to consider?
> >> Any Lucene class that helps achieve this?
> >>
> >> Thx!
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--089e011847a67a3f3c04dbf94ced--